Bug #1728111 “pxc cluster build failed due to leadership change ...” : Bugs : OpenStack Percona Cluster Charm

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-10-27:

#1

juju-crashdump-96cf1fe0-fa27-441e-a48b-ab9fd3e5d6d3.tar.gz Edit (14.8 MiB, application/x-tar)

Revision history for this message

James Page (james-page) wrote on 2017-10-30:

#2

Looking at the mysql log data:

./12/lxd/6/var/log/juju/unit-mysql-2.log

2017-10-27 16:26:09 INFO juju.worker.uniter resolver.go:104 found queued "install" hook

2017-10-27 16:42:24 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:42:24 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:42:25 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:04 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:44:04 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

./0/lxd/6/var/log/juju/unit-mysql-0.log

2017-10-27 16:25:56 INFO juju.worker.uniter resolver.go:104 found queued "install" hook

2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:35:30 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:35:31 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:36:50 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:36:50 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"
2017-10-27 16:43:57 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:43:57 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:43:59 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:58 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:44:58 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

pxc only installed once the lead unit has actually set the cluster root and sst passwords into leader storage; it would appear that at the time of install, non of the units was the leader, so the data was never seeded into leader storage.

Looking at the mysql log data:

./12/lxd/6/var/log/juju/unit-mysql-2.log

2017-10-27 16:26:09 INFO juju.worker.uniter resolver.go:104 found queued "install" hook

2017-10-27 16:42:24 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:42:24 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:42:25 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:04 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:44:04 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

./0/lxd/6/var/log/juju/unit-mysql-0.log

2017-10-27 16:25:56 INFO juju.worker.uniter resolver.go:104 found queued "install" hook

2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:35:30 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:35:31 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:36:50 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:36:50 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"
2017-10-27 16:43:57 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:43:57 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:43:59 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:58 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:44:58 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

pxc only installed once the lead unit has actually set the cluster root and sst passwords into leader storage; it would appear that at the time of install, non of the units was the leader, so the data was never seeded into leader storage.

Revision history for this message

James Page (james-page) wrote on 2017-10-30:

#3

Something wonky went on during early unit lifecycle:

2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: <nil>
2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:14 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:14 DEBUG juju.worker.dependency engine.go:486 "leadership-tracker" manifold worker started
2017-10-27 16:23:14 DEBUG juju.worker.leadership tracker.go:126 mysql/2 making initial claim for mysql leadership
2017-10-27 16:24:52 INFO juju.worker.leadership tracker.go:185 mysql/2 promoted to leadership of mysql
2017-10-27 16:26:09 DEBUG juju.worker.uniter.remotestate watcher.go:354 got leader settings change: ok=true
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql leadership for mysql/2 denied
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:230 notifying mysql/2 ticket of impending loss of mysql leadership
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:269 mysql/2 is not mysql leader
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:215 mysql/2 waiting for mysql leadership release
2017-10-27 16:36:10 DEBUG juju.worker.uniter.remotestate watcher.go:394 got leadership change: minion
2017-10-27 16:36:10 DEBUG install ERROR cannot write leadership settings: cannot write settings: not the leader
2017-10-27 16:36:10 DEBUG install leader_set({key: _password})
2017-10-27 16:36:10 DEBUG install File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/charmhelpers/core/hookenv.py", line 946, in leader_set
2017-10-27 16:36:10 DEBUG install subprocess.CalledProcessError: Command '['leader-set', 'root-password=6fwpXrzGGkb5gYqjmPk2qjxTSgLmbcR722Nwf4s2']' returned non-zero exit status 1

Something wonky went on during early unit lifecycle:

2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: <nil>
2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:14 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:14 DEBUG juju.worker.dependency engine.go:486 "leadership-tracker" manifold worker started
2017-10-27 16:23:14 DEBUG juju.worker.leadership tracker.go:126 mysql/2 making initial claim for mysql leadership
2017-10-27 16:24:52 INFO juju.worker.leadership tracker.go:185 mysql/2 promoted to leadership of mysql
2017-10-27 16:26:09 DEBUG juju.worker.uniter.remotestate watcher.go:354 got leader settings change: ok=true
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql leadership for mysql/2 denied
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:230 notifying mysql/2 ticket of impending loss of mysql leadership
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:269 mysql/2 is not mysql leader
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:215 mysql/2 waiting for mysql leadership release
2017-10-27 16:36:10 DEBUG juju.worker.uniter.remotestate watcher.go:394 got leadership change: minion
2017-10-27 16:36:10 DEBUG install ERROR cannot write leadership settings: cannot write settings: not the leader
2017-10-27 16:36:10 DEBUG install     leader_set({key: _password})
2017-10-27 16:36:10 DEBUG install   File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/charmhelpers/core/hookenv.py", line 946, in leader_set
2017-10-27 16:36:10 DEBUG install subprocess.CalledProcessError: Command '['leader-set', 'root-password=6fwpXrzGGkb5gYqjmPk2qjxTSgLmbcR722Nwf4s2']' returned non-zero exit status 1

Revision history for this message

James Page (james-page) wrote on 2017-10-30:

#4

and at the point where mysql/2 tried to write to leader storage:

2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql leadership for mysql/2 denied

summary:

- cluster-relation-changed KeyError: 'getpwnam(): name not found: mysql'
+ pxc cluster build failed due to leadership change in early unit
+ lifecycle

Revision history for this message

James Page (james-page) wrote on 2017-10-30:

#5

tl;dr leadership changed during the seeding of the passwords (so between a call to is-leader and leader-set) which the charm does not currently deal with so the cluster never bootstrapped.

I'm guessing this is not that easy to reproduce but at least the cause is visible from the log data provided; the logs from the controller might tell is more about why leadership changed.

Changed in charm-percona-cluster:
status:	New → Triaged
importance:	Undecided → Low

Revision history for this message

James Page (james-page) wrote on 2017-10-30:

#6

Adding a bug task for juju; this is a pretty small codeblock to have leadership switch between two lines:

    _password = leader_get(key)
    if not _password and is_leader():
        _password = config(key) or pwgen()
        leader_set({key: _password})
    return _password

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2017-10-30:

#7

I think the only way to really control for this error, is wrap every call to leader_set(...) in a try: ... except: as the leadership can change during hook execution. i.e. even if is_leader() -> True, it's still possible for a later leader_set(...) set to fail. It's better to catch that failure, and undo any 'leader' things the hook was doing, and then exit the hook, and the new leader unit to perform the leadership actions instead.

e.g. Unless Juju can provide a guarantee that leadership won't change during a hook execution, then charms are going to have to back out of a leader_set(...) failure gracefully.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-10-30:

#8

Juju need to confirm whether or not we have leadership bouncing between units.

Under "normal" circumstances, where normal means that we have continued network connectivity, once a unit is a leader, it should stay as leader until the API connection is dropped.

There have been reports before of leadership bouncing between units, and this is something we need to investigate. It is possible that clock skew could have been an issue before, but this is where the recent work has gone in to mitigate that problem.

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.3.0
assignee:	nobody → Andrew Wilkins (axwalk)

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-31: Re: [Bug 1728111] Re: pxc cluster build failed due to leadership change in early unit lifecycle

#9

It would be good to know from the logs how long *we* think it was for those
to lines to execute. On a heavily loaded system I think we've seen things a
spike as high as 45s for a query to execute which chews up most of the
lease time. Also if there was something like a controller restart, etc.

IIRC is_leader doesn't do an immediate refresh but just checks the current
status. It might make it more reliable if we just force a refresh at that
point.

John
=:->

On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:

> Juju need to confirm whether or not we have leadership bouncing between
> units.
>
> Under "normal" circumstances, where normal means that we have continued
> network connectivity, once a unit is a leader, it should stay as leader
> until the API connection is dropped.
>
> There have been reports before of leadership bouncing between units, and
> this is something we need to investigate. It is possible that clock skew
> could have been an issue before, but this is where the recent work has
> gone in to mitigate that problem.
>
> ** Changed in: juju
> Status: New => Triaged
>
> ** Changed in: juju
> Importance: Undecided => High
>
> ** Changed in: juju
> Milestone: None => 2.3.0
>
> ** Changed in: juju
> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1728111
>
> Title:
> pxc cluster build failed due to leadership change in early unit
> lifecycle
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-31:

#10

(This is speculation while on a walk, not while reading through the code)

Thinking it through... If is_leader isn't refreshing but we're only doing
our async "every 30s extend the lease by 1min". If something happened to
that async loop, you could see a case where is leader returns true but it
is failing to actually extend the lease.

Even more true if we are only looking at the agents local state when
answering is leader. If there is clock skewing happening what happens if we
get the leadership token and our clock jumps backward by 1 min. It seems
possible that locally we think we're the leader but don't try to refreach
the token because our time isn't up yet.

Auditing the code to make sure we're using durations and time.Since rather
than absolute times/deadlines would allow the monotonic timer of go 1.9 to
help out.

We also need to make sure we're confident we're not doing something wrong
when time is perfectly stable.

John
=:->

On Oct 31, 2017 07:10, "John Meinel" <email address hidden> wrote:

> It would be good to know from the logs how long *we* think it was for
> those to lines to execute. On a heavily loaded system I think we've seen
> things a spike as high as 45s for a query to execute which chews up most of
> the lease time. Also if there was something like a controller restart, etc.
>
> IIRC is_leader doesn't do an immediate refresh but just checks the current
> status. It might make it more reliable if we just force a refresh at that
> point.
>
> John
> =:->
>
> On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:
>
>> Juju need to confirm whether or not we have leadership bouncing between
>> units.
>>
>> Under "normal" circumstances, where normal means that we have continued
>> network connectivity, once a unit is a leader, it should stay as leader
>> until the API connection is dropped.
>>
>> There have been reports before of leadership bouncing between units, and
>> this is something we need to investigate. It is possible that clock skew
>> could have been an issue before, but this is where the recent work has
>> gone in to mitigate that problem.
>>
>> ** Changed in: juju
>> Status: New => Triaged
>>
>> ** Changed in: juju
>> Importance: Undecided => High
>>
>> ** Changed in: juju
>> Milestone: None => 2.3.0
>>
>> ** Changed in: juju
>> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1728111
>>
>> Title:
>> pxc cluster build failed due to leadership change in early unit
>> lifecycle
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>>
>

(This is speculation while on a walk, not while reading through the code)

Thinking it through... If is_leader isn't refreshing but we're only doing
our async "every 30s extend the lease by 1min". If something happened to
that async loop, you could see a case where is leader returns true but it
is failing to actually extend the lease.

Even more true if we are only looking at the agents local state when
answering is leader. If there is clock skewing happening what happens if we
get the leadership token and our clock jumps backward by 1 min. It seems
possible that locally we think we're the leader but don't try to refreach
the token because our time isn't up yet.

Auditing the code to make sure we're using durations and time.Since rather
than absolute times/deadlines would allow the monotonic timer of go 1.9 to
help out.

We also need to make sure we're confident we're not doing something wrong
when time is perfectly stable.

John
=:->

On Oct 31, 2017 07:10, "John Meinel" <john@arbash-meinel.com> wrote:

> It would be good to know from the logs how long *we* think it was for
> those to lines to execute. On a heavily loaded system I think we've seen
> things a spike as high as 45s for a query to execute which chews up most of
> the lease time. Also if there was something like a controller restart, etc.
>
> IIRC is_leader doesn't do an immediate refresh but just checks the current
> status. It might make it more reliable if we just force a refresh at that
> point.
>
> John
> =:->
>
> On Oct 31, 2017 00:35, "Tim Penhey" <tim.penhey@canonical.com> wrote:
>
>> Juju need to confirm whether or not we have leadership bouncing between
>> units.
>>
>> Under "normal" circumstances, where normal means that we have continued
>> network connectivity, once a unit is a leader, it should stay as leader
>> until the API connection is dropped.
>>
>> There have been reports before of leadership bouncing between units, and
>> this is something we need to investigate. It is possible that clock skew
>> could have been an issue before, but this is where the recent work has
>> gone in to mitigate that problem.
>>
>> ** Changed in: juju
>>        Status: New => Triaged
>>
>> ** Changed in: juju
>>    Importance: Undecided => High
>>
>> ** Changed in: juju
>>     Milestone: None => 2.3.0
>>
>> ** Changed in: juju
>>      Assignee: (unassigned) => Andrew Wilkins (axwalk)
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1728111
>>
>> Title:
>>   pxc cluster build failed due to leadership change in early unit
>>   lifecycle
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>>
>

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-10-31:

#11

"is-leader" does refresh. You can see the details here: https://github.com/juju/juju/blob/develop/worker/uniter/runner/context/leader.go#L54.

If the clock was jumping on the controller, then this could be explained. I've looked over the worker/lease and worker/leadership code, and it should now be sound when compiled with Go 1.9+ (which we now do), from Juju 2.3-beta2+ (new lease manager code).

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-31:

#12

Download full text (4.3 KiB)

So digging through the code we call
func (ctx *leadershipContext) ensureLeader() error {
...
success := ctx.tracker.ClaimLeader().Wait()

which submits a claim ticket and waits for it to respond, claim tickets are
handled here:
if err := t.resolveClaim(ticketCh); err != nil {
resolve claim calls
if leader, err := t.isLeader(); err != nil {
which then:
func (t *Tracker) isLeader() (bool, error) {
if !t.isMinion {
// Last time we looked, we were leader.
select {
case <-t.tomb.Dying():
return false, errors.Trace(tomb.ErrDying)
case <-t.renewLease:
logger.Tracef("%s renewing lease for %s leadership", t.unitName,
t.applicationName)
t.renewLease = nil
if err := t.refresh(); err != nil {
return false, errors.Trace(err)
}
default:
logger.Tracef("%s still has %s leadership", t.unitName, t.applicationName)
}
}
return !t.isMinion, nil
}

*that* looks to me like we only renew the lease if we are currently pending
a renewal (so on a 1min lease we only renew on IsLeader if we're past the
30s mark).
Otherwise the:
default: still leader
code triggers and we just return true.

So if the timing was:
0s: renew leadership for 60s
25s: call IsLeader (no actual refresh)
There doesn't appear to be any database activity after isLeader returns
true

All that refreshing would do is increase the window, which we could
probably do in a different way (just increase the lease time).

The other curious bit is the timing from the log:
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"

That is a full 2m35s from the time we see "is-leader" being called before
"leader-set" is then called.

Given the comment here:

    _password = leader_get(key)
    if not _password and is_leader():
        _password = config(key) or pwgen()
        leader_set({key: _password})
    return _password

Is pwgen() actually quite slow on a heavily loaded machine? Is it grabbing
lots of entropy/reading from /dev/random rather than /dev/urandom and
getting blocked?

So 2m45s is quite a long time. But also note that other things are
surprisingly slow:
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql
leadership for mysql/2 denied

Is it really taking us ~5minutes to deal with the leader-set call? or are
these 2 separate calls we're dealing with?

I'm assuming mysql/2 is the one running in the "something wonky went on
early".

We see that mysql/2 was set to be the leader at 16:24:

2017-10-27 16:23:14 DEBUG juju.worker.leadership tracker.go:126 mysql/2
making initial claim for mysql leadership
2017-10-27 16:24:52 INFO juju.worker.leadership tracker.go:185 mysql/2
promoted to leadership of mysql

At 16:36:10 mysql/2 is told its no longer the leader, but at 16:35:30 is
where mysql/0 is told that is now the leader:

2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued
"leader-elected" hook

I'm heading back to the raw logs now, but nearly 3min from a is-lea...

So digging through the code we call
func (ctx *leadershipContext) ensureLeader() error {
...
success := ctx.tracker.ClaimLeader().Wait()

which submits a claim ticket and waits for it to respond, claim tickets are
handled here:
if err := t.resolveClaim(ticketCh); err != nil {
resolve claim calls
if leader, err := t.isLeader(); err != nil {
which then:
func (t *Tracker) isLeader() (bool, error) {
if !t.isMinion {
// Last time we looked, we were leader.
select {
case <-t.tomb.Dying():
return false, errors.Trace(tomb.ErrDying)
case <-t.renewLease:
logger.Tracef("%s renewing lease for %s leadership", t.unitName,
t.applicationName)
t.renewLease = nil
if err := t.refresh(); err != nil {
return false, errors.Trace(err)
}
default:
logger.Tracef("%s still has %s leadership", t.unitName, t.applicationName)
}
}
return !t.isMinion, nil
}

*that* looks to me like we only renew the lease if we are currently pending
a renewal (so on a 1min lease we only renew on IsLeader if we're past the
30s mark).
Otherwise the:
default: still leader
code triggers and we just return true.

So if the timing was:
 0s: renew leadership for 60s
 25s: call IsLeader (no actual refresh)
 There doesn't appear to be any database activity after isLeader returns
true

All that refreshing would do is increase the window, which we could
probably do in a different way (just increase the lease time).

The other curious bit is the timing from the log:
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"

That is a full 2m35s from the time we see "is-leader" being called before
"leader-set" is then called.

Given the comment here:

_password = leader_get(key)
    if not _password and is_leader():
        _password = config(key) or pwgen()
        leader_set({key: _password})
    return _password

Is pwgen() actually quite slow on a heavily loaded machine? Is it grabbing
lots of entropy/reading from /dev/random rather than /dev/urandom and
getting blocked?

So 2m45s is quite a long time. But also note that other things are
surprisingly slow:
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql
leadership for mysql/2 denied

Is it really taking us ~5minutes to deal with the leader-set call? or are
these 2 separate calls we're dealing with?

I'm assuming mysql/2 is the one running in the "something wonky went on
early".

We see that mysql/2 was set to be the leader at 16:24:

2017-10-27 16:23:14 DEBUG juju.worker.leadership tracker.go:126 mysql/2
making initial claim for mysql leadership
2017-10-27 16:24:52 INFO juju.worker.leadership tracker.go:185 mysql/2
promoted to leadership of mysql

At 16:36:10 mysql/2 is told its no longer the leader, but at 16:35:30 is
where mysql/0 is told that is now the leader:

2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued
"leader-elected" hook

I'm heading back to the raw logs now, but nearly 3min from a is-leader to a
leader-set is *very* long.

If we have a reproducible test case there are some TRACE level logs that
might be informative if they aren't too spammy. (then again, if this is all
load related, we're likely to lose logs/crush the system by increasing
verbosity.)

John
=:->
<https://bugs.launchpad.net/charm-percona-cluster/+bug/1728111/+index#>

On Tue, Oct 31, 2017 at 9:46 AM, Andrew Wilkins <
andrew.wilkins@canonical.com> wrote:

> "is-leader" does refresh. You can see the details here:
> https://github.com/juju/juju/blob/develop/worker/uniter/
> runner/context/leader.go#L54.
>
> If the clock was jumping on the controller, then this could be
> explained. I've looked over the worker/lease and worker/leadership code,
> and it should now be sound when compiled with Go 1.9+ (which we now do),
> from Juju 2.3-beta2+ (new lease manager code).
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1728111
>
> Title:
>   pxc cluster build failed due to leadership change in early unit
>   lifecycle
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-31:

#13

Side note, we do potentially have a serious issue about responding to relation data and coordination of leadership. Our statement that we guarantee you will have no more than 1 leader at any given time doesn't work well with arbitrary hooks in response to relation data changes.
Here is an example timeline:

0s mysql/0 => becomes the leader (goes unresponsive for a bit)
20s rabbit/0 => joins the relation with mysql and sets data in the relation bucket that only the leader can handle
35s mysql/1 sees rabbits data but is not the leader
35s mysql/2 sees rabbits data but is not the leader
60s mysql/0 demoted, mysql/1 is now the leader
65s mysql/1 sees the relation data from rabbit but is no longer the leader

There is no guarantee that there will be a leader that sees relation change data.
The one backstop would be 'leader-elected', which could go through and re-evaluate if there is anything that the previous leader missed. (look at your existing relations, and see if there was something you didn't handle earlier because you weren't the leader, that the last leader also failed to handle).

All of the above is possible even with nothing wrong with our leader election process. All it takes is for the machine where the leader is currently running to be busy with other hooks (colocated workloads), that it takes too long for what was the leader to actually respond to a relation.

I'd like us to figure out what they need as charmers to actually handle this case. Should there be an idea of "if I become the leader this is what I would want to do", that gets set aside as context that gets presented again as context during leader-elected?

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-31:

#14

The logs show that leader-elected isn't implemented, which probably means that you can suffer from comment #13:
2017-10-27 16:35:31 INFO juju-log Unknown hook leader-elected - skipping.

I was discussing with Andrew, and one thing that we are thinking about this cycle is trying to introduce Application <=> Application relation data, rather than just having Unit <=> Application data.
In that context, it would be interesting to consider having a "relation-joined/changed" hook that is actually *guaranteed* to fire on the current leader, and if leadership changes and the hook has not exited successfully in the past, then the hook is triggered on the new leader.
The initial scope around Application data bags would not change the hook logic, so it wouldn't actually address this bug, but in the stuff we are calling "charms v2" and trying to change what hooks are fired, we could potentially address it there.

Potentially we could introduce a new hook more easily than deprecating all the existing hooks that we fire. Which would allow you to have something like "application-relation-changed", or some other spelling. Having some sort of knowledge around "what is the latest version of relation data that a leader has processed for all of its relations" and then always triggering a 'changed' hook whenever either the leader changes or the relation changes, and then recording that a leader has processed up to 'revno=X'.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-31:

#15

Looking at the charm: https://jujucharms.com/percona-cluster/
It does have a symlink of "leader-elected => percona_hooks.py"
but the Python code itself is hitting this line:
    try:
        hooks.execute(sys.argv)
    except UnregisteredHookError as e:
        log('Unknown hook {} - skipping.'.format(e))

So its more a case that you're not actually responding when leader-elected really is fired.

Revision history for this message

James Page (james-page) wrote on 2017-10-31:

#16

I think the recommendation in #15 to implement the leader-elected hook, and deal with anything missing at that point in time makes alot of sense.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-11-07:

#17

I'm going to mark the Juju task invalid for now then based on John's comments above.

Changed in juju:
milestone:	2.3.0 → 2.3-rc1
status:	Triaged → Invalid
milestone:	2.3-rc1 → none
assignee:	Andrew Wilkins (axwalk) → nobody

Revision history for this message

James Page (james-page) wrote on 2017-11-15:

#18

Setting Juju bug back to New; we can improve the charm but leader switching mid hook execution makes writing charms harder, so we should see if things can be improved.

Changed in juju:
status:	Invalid → New

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2017-11-15:

#19

Agree with James.

Changing leader whilst a hook is executing on the leader is not something we should expect charms and charmers to trap.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2017-11-15:

#20

Can Juju also document the assurances made for leadership election, when/why it is determined to be changed, etc? This would be helpful documentation for charm authors to reference.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-15:

#21

It is taking you 2.5min to go from "is_leader" until we get to
"leader_set". If it is taking that long, your system is under enough load
that we apparently are unable to guarantee keep-alives. (We need a refresh
of leadership which is done by the unit agent every 30s that extends the
leadership for another 1 minute.)
I don't know what exactly is causing it to take 2.5min, but if we can't get
a network request 1/minute then we would allow leadership to lapse.

On Wed, Nov 15, 2017 at 8:24 PM, Ryan Beisner <email address hidden>
wrote:

> Can Juju also document the assurances made for leadership election,
> when/why it is determined to be changed, etc? This would be helpful
> documentation for charm authors to reference.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1728111
>
> Title:
> pxc cluster build failed due to leadership change in early unit
> lifecycle
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-11-15:

#22

Download full text (7.2 KiB)

Sorry, it's a long message but I've got meaningful stuff there (I think).

https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/2
https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/3

The behavior I encountered in a duplicate bug got me thinking about how to fix this problem at both Juju and charm levels (both will need modifications).

TL;DR:

Juju: revive "leader-deposed" hook work - actually run that hook instead of a no-op (see https://git.io/vF1Jn)

Charmers: Modify charms with service-level leadership (not only Juju-level) to use leader-deposed.

Juju: Document when is_leader no longer returns TRUE and think about leader transactions (where a leader executes code and cannot be deposed until finishes execution or its process dies) or document operation interruption semantics (if any).

========

Topic 1.

Description:

For clarity, I will name 2 levels of leadership:

* level 1 (L1): Juju-level per-application unit leadership (a leader unit is an actor here);
* level 2 (L2): application-specific or service-specific leadership (a percona cluster process in this case, no explicit mapping from L2: L1 or L1: L2 leadership)

What happened (pad.lv/1732257)?

L1 leader got elected and started bootstrapping a cluster so L2 leader got created => L1 leader == L2 leader

L1 minions have not done <peer>-relation-joined yet => L1 leader cannot tell them that it is the L2 leader and there are no L2 minion processes yet => waits for more <peer>-relation-{joined, changed} events

L1-minion-0 got installed and joined a peer relation with the L1 leader but there are only 2/3 peers (min-cluster-size config option gating) => L2-minion-0 has NOT been set up yet (2/3, not clustered, not an L1 leader - no config rendering, no process running).

L1-leader got deposed, however, did not perform any action to depose L2 leader => **L1-minion-2**

L1-minion-1 became L1-leader and **started** bootstrapping a new cluster => L1 leader != L2 leader => 2 L2 leaders present!

L1-minion-0 started its service and spawned an L2 minion which got cluster state from L1-minion-2 (the old L1 and now contending L2 leader) ***before it got it from L1-leader*** => 2 leaders and 1 minion present - hit a RACE CONDITION on L2

L1-leader (new) set a new bootstrap_uuid leader bucket setting which is inconsistent with L2 UUIDs at L1-minion-0 and L1-minion-2 => hook errors at both L1-minion-0 and L1-minion-2

So in the final state there are no errors on L1-leader (new) as it has bootstrap_uuid that was set by it via leader-set (leader_settings_bootstrap_uuid == L1-leader_local_service_bootstrap_uuid)

2 minions are in a separate L2 cluster and have service-level UUIDs that are inconsistent with the leader setting.

AFAIKS Juju already has a somewhat transactional nature for leadership changes - there is a "Resign" operation and "leader-deposed hook" which apparently is not run (no-op):

https://github.com/juju/juju/blame/juju-2.2.6/worker/uniter/operation/leader.go#L74-L79

2017-11-14 17:21:32 INFO juju.worker.uniter.operation runhook.go:113 ran "shared-db-relation-changed" hook
2017-11-14 17:21:32 DEBUG juju.worker.uniter.operation executor.go:100 ...

Changed in juju:
importance:	High → Low
tags:	added: expirebugs-bot

OpenStack Percona Cluster Charm

pxc cluster build failed due to leadership change in early unit lifecycle

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
Canonical Juju	Triaged	Low	Unassigned
Charm Helpers	New	Undecided	Unassigned
OpenStack Percona Cluster Charm	Triaged	Low	Unassigned