Bug #1810331 “Mid-hook lost leadership issues” : Bugs : Canonical Juju

Drew Freiberger (afreiberger) on 2019-01-02

Changed in juju-lint:
status:	New → Invalid

Revision history for this message

Richard Harding (rharding) wrote on 2019-01-08:

#1

Do we have any logs of the actual hook execs going on during this time? They'd be really helpful in trying to track the logic and possible repro steps.

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → 2.6-beta1

Revision history for this message

Richard Harding (rharding) wrote on 2019-01-24:

#2

This was hit in PS 4.5 today, logs of telegraf hitting this https://pastebin.canonical.com/p/SCZrYHXqt4/

Revision history for this message

William Grant (wgrant) wrote on 2019-01-25:

#3

The stg-ols-snap-store controller (currently 2.5-rc1, using raft leases) is affected by what seems to be a similar bug. The controller has one non-controller model, jsft for which is at https://pastebin.canonical.com/p/3Kyq4fbW4F/. There are a number of services for which juju status doesn't know about a leader, but is-leader is true on exactly one unit.

[STAGING] stg-ols-snap-store@wendigo:~$ juju run --application cassandra is-leader
- Stdout: |
    False
  UnitId: cassandra/3
- Stdout: |
    False
  UnitId: cassandra/4
- Stdout: |
    True
  UnitId: cassandra/5

[STAGING] stg-ols-snap-store@wendigo:~$ jsft | grep ^cassandra
cassandra active 3 cassandra local 1 ubuntu
cassandra/3 active idle 237 10.50.79.95 9042/tcp,9160/tcp Live seed
cassandra/4 active idle 238 10.50.79.96 9042/tcp,9160/tcp Live node
cassandra/5 active idle 239 10.50.79.97 9042/tcp,9160/tcp Live seed

Controller log since the upgrade: https://pastebin.canonical.com/p/3NFNkYcBQ5/

Unit log from the sole unit of an application that has no leader in status:
Immediately after the upgrade: https://pastebin.canonical.com/p/p8hRsRpqSB/
All mentions of "leader": https://pastebin.canonical.com/p/v4Jq7QNs7M/

While I was interrogating the controller, it OOMed and restarted. The cassandra application, at least, remains in an identical state: status reports no leader, but is-leader is true only on cassandra/5.

Controller log for the restart: https://pastebin.canonical.com/p/M45gXhhVHM/
cassandra/5 agent log for the restart: https://pastebin.canonical.com/p/JR9dBpWvGj/

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-25:

#4

Here's a snippet found on IRC logs which appears to mention the state bit of the code:

2019-01-02 19:42:59 DEBUG identity-service-relation-changed ERROR cannot write leadership settings: cannot write settings: failed to merge leadership settings: state changing too quickly; try again soon
2019-01-02 19:42:59 DEBUG identity-service-relation-changed Traceback (most recent call last):

William Grant (wgrant) on 2019-01-25

affects:	juju-lint → null-and-void
no longer affects:	null-and-void

Ian Booth (wallyworld) on 2019-01-25

Changed in juju:
milestone:	2.6-beta1 → 2.5.1
importance:	Medium → Critical

Richard Harding (rharding) on 2019-01-25

Changed in juju:
assignee:	nobody → Joseph Phillips (manadart)
status:	Triaged → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2019-01-28: Re: [Bug 1810331] Re: Mid-hook lost leadership issues

#5

I think Christian and I worked this out today. Specifically,

a) Raft keeps an FSM which tracks who the current leader is.
b) When the leader changes Raft writes to the leaseholders collection the
identity of the current lease holder.
c) When making a change to leader-settings content, we ask Raft to Check
that we are currently the leader
d) We then create a transaction that asserts the holder (matching b).

However, it turns out that (b) can fail (do to mongo contention, timeout,
etc) and thus we never actually complete (b). Raft can't rollback an FSM
change so we end up inconsistent.

The reason you get "state changing too quickly" is because the check in (c)
is against memory, while the assert in (d) is against the database.

However, the Raft FSM is intended to be the one-true-source-of-all-truth
about leaders. It just happens that it couldn't update the database copy.
However, during (c) we can check if the database is consistent with memory,
and if not, go update the database.

We're reasonably confident about the source of the errors because looking
in controller logs we can see:
./machine-2.2.log:775239:2019-01-23 05:48:14 ERROR
juju.worker.raft.raftforwarder target.go:168 couldn't claim lease
"e39da954-406c-4e8d-8da8-4cfd8e979895:application-leadership#landscape-client#"
for "landscape-client/0": read tcp 127.0.0.1:34748->127.0.0.1:37017: i/o
timeout

And that is exactly the message you get when Raft fails to update Mongo.

As a performance optimization, step (c) can just do the in-memory check on
attempt=1, and only go reread and update the database if the first attempt
gets aborted. (Its what we do in about 90% of the cases anyway. Start with
in-memory state, if the txn fails, reread from the DB and try again.)

I believe that Christian is going to be working on this during his morning
tomorrow, if we want to have Joe work on something else. Or he can just
finish the work before Christian starts, and then Christian can work on
some of the other things. (Like not having ClaimLeadership(onetoken)
creating a map of *all* leaders to answer that question.)

On Fri, Jan 25, 2019 at 4:51 PM Richard Harding <email address hidden>
wrote:

> ** Changed in: juju
> Assignee: (unassigned) => Joseph Phillips (manadart)
>
> ** Changed in: juju
> Status: Triaged => In Progress
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1810331
>
> Title:
> Mid-hook lost leadership issues
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1810331/+subscriptions
>