Mid-hook lost leadership issues

Bug #1810331 reported by Drew Freiberger
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Christian Muirhead

Bug Description

I just had a situation happen in a unique environment that has undergone juju series upgrades (1.25 -> 2.4.x -> 2.5-beta3), and trusty->xenial series upgrade and is now going through openstack charm upgrades. We had a relation-changed hook in the keystone charm firing during a leadership re-election where the identity-credentials-relation-changed hook was trying to perform leader-set functions as the unit was the elected-leader at time of the hook execution. Somehow, juju controller lost this state and no keystone units had leadership. I performed a resolved --no-retry to drop that action, but the next hook that fired also had state of being run in a is-leader=true context.

Checking juju status, no unit of keystone had leadership, and the leader-set commands were returning leadership-election cycling too quickly, try again later errors.

To resolve, I stopped the jujud-unit-keystone-X service and once juju controllers lost that agent, it then elected one of the other units leader and I was able to restart that unit.

It seems that the hook context kept leadership knowledge somehow, but the juju controllers did not agree that the unit was leader, but it didn't actually list any other unit as a leader, almost like it knew that this unit was stuck in a state of knowing it was leader, but not allowing it to continue functioning as leader until it could get to a leader-changed hook or agent lost state.

Changed in juju-lint:
status: New → Invalid
Revision history for this message
Richard Harding (rharding) wrote :

Do we have any logs of the actual hook execs going on during this time? They'd be really helpful in trying to track the logic and possible repro steps.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 2.6-beta1
Revision history for this message
Richard Harding (rharding) wrote :

This was hit in PS 4.5 today, logs of telegraf hitting this https://pastebin.canonical.com/p/SCZrYHXqt4/

Revision history for this message
William Grant (wgrant) wrote :

The stg-ols-snap-store controller (currently 2.5-rc1, using raft leases) is affected by what seems to be a similar bug. The controller has one non-controller model, jsft for which is at https://pastebin.canonical.com/p/3Kyq4fbW4F/. There are a number of services for which juju status doesn't know about a leader, but is-leader is true on exactly one unit.

[STAGING] stg-ols-snap-store@wendigo:~$ juju run --application cassandra is-leader
- Stdout: |
    False
  UnitId: cassandra/3
- Stdout: |
    False
  UnitId: cassandra/4
- Stdout: |
    True
  UnitId: cassandra/5

[STAGING] stg-ols-snap-store@wendigo:~$ jsft | grep ^cassandra
cassandra active 3 cassandra local 1 ubuntu
cassandra/3 active idle 237 10.50.79.95 9042/tcp,9160/tcp Live seed
cassandra/4 active idle 238 10.50.79.96 9042/tcp,9160/tcp Live node
cassandra/5 active idle 239 10.50.79.97 9042/tcp,9160/tcp Live seed

Controller log since the upgrade: https://pastebin.canonical.com/p/3NFNkYcBQ5/

Unit log from the sole unit of an application that has no leader in status:
  Immediately after the upgrade: https://pastebin.canonical.com/p/p8hRsRpqSB/
  All mentions of "leader": https://pastebin.canonical.com/p/v4Jq7QNs7M/

While I was interrogating the controller, it OOMed and restarted. The cassandra application, at least, remains in an identical state: status reports no leader, but is-leader is true only on cassandra/5.

Controller log for the restart: https://pastebin.canonical.com/p/M45gXhhVHM/
cassandra/5 agent log for the restart: https://pastebin.canonical.com/p/JR9dBpWvGj/

Revision history for this message
Ian Booth (wallyworld) wrote :

Here's a snippet found on IRC logs which appears to mention the state bit of the code:

2019-01-02 19:42:59 DEBUG identity-service-relation-changed ERROR cannot write leadership settings: cannot write settings: failed to merge leadership settings: state changing too quickly; try again soon
2019-01-02 19:42:59 DEBUG identity-service-relation-changed Traceback (most recent call last):

William Grant (wgrant)
affects: juju-lint → null-and-void
no longer affects: null-and-void
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.6-beta1 → 2.5.1
importance: Medium → Critical
Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1810331] Re: Mid-hook lost leadership issues

I think Christian and I worked this out today. Specifically,

a) Raft keeps an FSM which tracks who the current leader is.
b) When the leader changes Raft writes to the leaseholders collection the
identity of the current lease holder.
c) When making a change to leader-settings content, we ask Raft to Check
that we are currently the leader
d) We then create a transaction that asserts the holder (matching b).

However, it turns out that (b) can fail (do to mongo contention, timeout,
etc) and thus we never actually complete (b). Raft can't rollback an FSM
change so we end up inconsistent.

The reason you get "state changing too quickly" is because the check in (c)
is against memory, while the assert in (d) is against the database.

However, the Raft FSM is intended to be the one-true-source-of-all-truth
about leaders. It just happens that it couldn't update the database copy.
However, during (c) we can check if the database is consistent with memory,
and if not, go update the database.

We're reasonably confident about the source of the errors because looking
in controller logs we can see:
./machine-2.2.log:775239:2019-01-23 05:48:14 ERROR
juju.worker.raft.raftforwarder target.go:168 couldn't claim lease
"e39da954-406c-4e8d-8da8-4cfd8e979895:application-leadership#landscape-client#"
for "landscape-client/0": read tcp 127.0.0.1:34748->127.0.0.1:37017: i/o
timeout

And that is exactly the message you get when Raft fails to update Mongo.

As a performance optimization, step (c) can just do the in-memory check on
attempt=1, and only go reread and update the database if the first attempt
gets aborted. (Its what we do in about 90% of the cases anyway. Start with
in-memory state, if the txn fails, reread from the DB and try again.)

I believe that Christian is going to be working on this during his morning
tomorrow, if we want to have Joe work on something else. Or he can just
finish the work before Christian starts, and then Christian can work on
some of the other things. (Like not having ClaimLeadership(onetoken)
creating a map of *all* leaders to answer that question.)

On Fri, Jan 25, 2019 at 4:51 PM Richard Harding <email address hidden>
wrote:

> ** Changed in: juju
> Assignee: (unassigned) => Joseph Phillips (manadart)
>
> ** Changed in: juju
> Status: Triaged => In Progress
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1810331
>
> Title:
> Mid-hook lost leadership issues
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1810331/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

To mention to those following this bug, the 'state changing too quickly' is a post symptom of an earlier "cloudn't claim lease" error that you should see in the log if our analysis is correct.

Changed in juju:
assignee: Joseph Phillips (manadart) → Christian Muirhead (2-xtian)
Revision history for this message
Christian Muirhead (2-xtian) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.