Canonical Juju

clock updated concurrently on a non-ha controller

Bug #1900724 reported by Laurent Sesquès on 2020-10-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Joseph Phillips	Canonical Juju 2.8.7

Bug Description

This is on a non-ha controller, recently upgraded from 2.7.6 to 2.8.3.

2020-10-12 22:19:05 ERROR juju.core.raftlease store.go:277 command Command(ver: 1, op: setTime, new time: 0002-10-02 08:39:05.716668447 +0000 UTC): clock was updated concurrently, retry
2020-10-13 04:11:06 ERROR juju.core.raftlease store.go:277 command Command(ver: 1, op: setTime, new time: 0002-10-02 14:22:47.694097189 +0000 UTC): clock was updated concurrently, retry

[...]

2020-10-20 14:39:57 ERROR juju.core.raftlease store.go:277 command Command(ver: 1, op: setTime, new time: 0001-01-01 00:00:01.015763373 +0000 UTC): clock was updated concurrently, retry

I think that this should not be possible in a non-ha controller.

See original description

Laurent Sesquès (sajoupa) on 2020-10-20

description:

updated

Pen Gale (pengale) on 2020-10-26

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.8.7

Revision history for this message

Pen Gale (pengale) wrote on 2020-10-26:

Triaging and adding to milestone for further investigation.

It's possible that we are firing off a worker twice, which might be responsible for other raft oddities. This is definitely worth digging into!

Revision history for this message

Joseph Phillips (manadart) wrote on 2020-10-27:

Got a reproducer upgrading from 2.8 HEAD to develop HEAD.

Changed in juju:
status:	Triaged → In Progress

Joseph Phillips (manadart) on 2020-10-28

Changed in juju:
status:	In Progress → Incomplete

Revision history for this message

Joseph Phillips (manadart) wrote on 2020-10-28:

This should be a transient error.

When the global clock updater advances time via the Raft store, it sends what it thinks is the old global time in the command.

If this time does not match the Raft FSM last known time, the concurrent update error is returned. The mechanism protects against concurrent updates, which is reflected in the error message, but that doesn't *quite* match what is happening here.

It actually results in the updater syncing its last known time with the FSM and retrying with a back-off.

From what I can tell, this is possible after restarting controller machine agents and restoring the Raft FSM from a snapshot, in which case the updater has a different value from the FSM.

Changed in juju:
status:	Incomplete → In Progress

Revision history for this message

Joseph Phillips (manadart) wrote on 2020-10-28:

This is pertinent log capture following a restart.
https://pastebin.canonical.com/p/DjnWdt5rsW/

I can see some issues here.

1. The lease manager makes claims against what appears to be an empty FSM *before* it has completed restoring from its snapshot - you can see the "no leasing found, claiming.." entries.

2. The initial clock advancement, the worker having been instantiated with an empty FSM, is made with an old time of 0001-01-01 00:00:00 +0000 UTC, which is incorrect following the FSM restoration (mentioned above).

3. Raft leader election occurs, relevant workers bounce, there are lease claim time-outs following.

The logs show convergence and success following this jitter, but we don't really want large models spammed with time-out errors every time there is a Raft leader election.

Revision history for this message

Joseph Phillips (manadart) wrote on 2020-11-03:

I've proposed https://github.com/juju/juju/pull/12248 in light of this ticket.

It doesn't change the fundamentals, but I think uses better nomenclature.

Ian Booth (wallyworld) on 2020-11-05

Changed in juju:
assignee:	nobody → Joseph Phillips (manadart)

Joseph Phillips (manadart) on 2020-11-06

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2020-12-14

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.