pxc cluster build failed due to leadership change in early unit lifecycle
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned | ||
Charm Helpers |
New
|
Undecided
|
Unassigned | ||
OpenStack Percona Cluster Charm |
Triaged
|
Low
|
Unassigned |
Bug Description
The mysql/0 unit in my deployment failed a cluster-
Here's the error:
http://
Indeed, there is no mysql entry in /etc/passwd.
I've attached full logs from the run.
Jason Hobbs (jason-hobbs) wrote : | #1 |
James Page (james-page) wrote : | #2 |
James Page (james-page) wrote : | #3 |
Something wonky went on during early unit lifecycle:
2017-10-27 16:23:08 DEBUG juju.worker.
2017-10-27 16:23:08 DEBUG juju.worker.
2017-10-27 16:23:08 DEBUG juju.worker.
2017-10-27 16:23:14 DEBUG juju.worker.
2017-10-27 16:23:14 DEBUG juju.worker.
2017-10-27 16:23:14 DEBUG juju.worker.
2017-10-27 16:24:52 INFO juju.worker.
2017-10-27 16:26:09 DEBUG juju.worker.
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.
2017-10-27 16:36:10 DEBUG juju.worker.
2017-10-27 16:36:10 DEBUG juju.worker.
2017-10-27 16:36:10 DEBUG juju.worker.
2017-10-27 16:36:10 DEBUG juju.worker.
2017-10-27 16:36:10 DEBUG install ERROR cannot write leadership settings: cannot write settings: not the leader
2017-10-27 16:36:10 DEBUG install leader_set({key: _password})
2017-10-27 16:36:10 DEBUG install File "/var/lib/
2017-10-27 16:36:10 DEBUG install subprocess.
James Page (james-page) wrote : | #4 |
and at the point where mysql/2 tried to write to leader storage:
2017-10-27 16:36:10 INFO juju.worker.
summary: |
- cluster-relation-changed KeyError: 'getpwnam(): name not found: mysql' + pxc cluster build failed due to leadership change in early unit + lifecycle |
James Page (james-page) wrote : | #5 |
tl;dr leadership changed during the seeding of the passwords (so between a call to is-leader and leader-set) which the charm does not currently deal with so the cluster never bootstrapped.
I'm guessing this is not that easy to reproduce but at least the cause is visible from the log data provided; the logs from the controller might tell is more about why leadership changed.
Changed in charm-percona-cluster: | |
status: | New → Triaged |
importance: | Undecided → Low |
James Page (james-page) wrote : | #6 |
Adding a bug task for juju; this is a pretty small codeblock to have leadership switch between two lines:
_password = leader_get(key)
if not _password and is_leader():
_password = config(key) or pwgen()
return _password
Alex Kavanagh (ajkavanagh) wrote : | #7 |
I think the only way to really control for this error, is wrap every call to leader_set(...) in a try: ... except: as the leadership can change during hook execution. i.e. even if is_leader() -> True, it's still possible for a later leader_set(...) set to fail. It's better to catch that failure, and undo any 'leader' things the hook was doing, and then exit the hook, and the new leader unit to perform the leadership actions instead.
e.g. Unless Juju can provide a guarantee that leadership won't change during a hook execution, then charms are going to have to back out of a leader_set(...) failure gracefully.
Tim Penhey (thumper) wrote : | #8 |
Juju need to confirm whether or not we have leadership bouncing between units.
Under "normal" circumstances, where normal means that we have continued network connectivity, once a unit is a leader, it should stay as leader until the API connection is dropped.
There have been reports before of leadership bouncing between units, and this is something we need to investigate. It is possible that clock skew could have been an issue before, but this is where the recent work has gone in to mitigate that problem.
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → High |
milestone: | none → 2.3.0 |
assignee: | nobody → Andrew Wilkins (axwalk) |
John A Meinel (jameinel) wrote : Re: [Bug 1728111] Re: pxc cluster build failed due to leadership change in early unit lifecycle | #9 |
It would be good to know from the logs how long *we* think it was for those
to lines to execute. On a heavily loaded system I think we've seen things a
spike as high as 45s for a query to execute which chews up most of the
lease time. Also if there was something like a controller restart, etc.
IIRC is_leader doesn't do an immediate refresh but just checks the current
status. It might make it more reliable if we just force a refresh at that
point.
John
=:->
On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:
> Juju need to confirm whether or not we have leadership bouncing between
> units.
>
> Under "normal" circumstances, where normal means that we have continued
> network connectivity, once a unit is a leader, it should stay as leader
> until the API connection is dropped.
>
> There have been reports before of leadership bouncing between units, and
> this is something we need to investigate. It is possible that clock skew
> could have been an issue before, but this is where the recent work has
> gone in to mitigate that problem.
>
> ** Changed in: juju
> Status: New => Triaged
>
> ** Changed in: juju
> Importance: Undecided => High
>
> ** Changed in: juju
> Milestone: None => 2.3.0
>
> ** Changed in: juju
> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https:/
>
> Title:
> pxc cluster build failed due to leadership change in early unit
> lifecycle
>
> To manage notifications about this bug go to:
> https:/
>
John A Meinel (jameinel) wrote : | #10 |
(This is speculation while on a walk, not while reading through the code)
Thinking it through... If is_leader isn't refreshing but we're only doing
our async "every 30s extend the lease by 1min". If something happened to
that async loop, you could see a case where is leader returns true but it
is failing to actually extend the lease.
Even more true if we are only looking at the agents local state when
answering is leader. If there is clock skewing happening what happens if we
get the leadership token and our clock jumps backward by 1 min. It seems
possible that locally we think we're the leader but don't try to refreach
the token because our time isn't up yet.
Auditing the code to make sure we're using durations and time.Since rather
than absolute times/deadlines would allow the monotonic timer of go 1.9 to
help out.
We also need to make sure we're confident we're not doing something wrong
when time is perfectly stable.
John
=:->
On Oct 31, 2017 07:10, "John Meinel" <email address hidden> wrote:
> It would be good to know from the logs how long *we* think it was for
> those to lines to execute. On a heavily loaded system I think we've seen
> things a spike as high as 45s for a query to execute which chews up most of
> the lease time. Also if there was something like a controller restart, etc.
>
> IIRC is_leader doesn't do an immediate refresh but just checks the current
> status. It might make it more reliable if we just force a refresh at that
> point.
>
> John
> =:->
>
> On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:
>
>> Juju need to confirm whether or not we have leadership bouncing between
>> units.
>>
>> Under "normal" circumstances, where normal means that we have continued
>> network connectivity, once a unit is a leader, it should stay as leader
>> until the API connection is dropped.
>>
>> There have been reports before of leadership bouncing between units, and
>> this is something we need to investigate. It is possible that clock skew
>> could have been an issue before, but this is where the recent work has
>> gone in to mitigate that problem.
>>
>> ** Changed in: juju
>> Status: New => Triaged
>>
>> ** Changed in: juju
>> Importance: Undecided => High
>>
>> ** Changed in: juju
>> Milestone: None => 2.3.0
>>
>> ** Changed in: juju
>> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https:/
>>
>> Title:
>> pxc cluster build failed due to leadership change in early unit
>> lifecycle
>>
>> To manage notifications about this bug go to:
>> https:/
>>
>
Andrew Wilkins (axwalk) wrote : | #11 |
"is-leader" does refresh. You can see the details here: https:/
If the clock was jumping on the controller, then this could be explained. I've looked over the worker/lease and worker/leadership code, and it should now be sound when compiled with Go 1.9+ (which we now do), from Juju 2.3-beta2+ (new lease manager code).
John A Meinel (jameinel) wrote : | #12 |
So digging through the code we call
func (ctx *leadershipContext) ensureLeader() error {
...
success := ctx.tracker.
which submits a claim ticket and waits for it to respond, claim tickets are
handled here:
if err := t.resolveClaim(
resolve claim calls
if leader, err := t.isLeader(); err != nil {
which then:
func (t *Tracker) isLeader() (bool, error) {
if !t.isMinion {
// Last time we looked, we were leader.
select {
case <-t.tomb.Dying():
return false, errors.
case <-t.renewLease:
logger.Tracef("%s renewing lease for %s leadership", t.unitName,
t.applicationName)
t.renewLease = nil
if err := t.refresh(); err != nil {
return false, errors.Trace(err)
}
default:
logger.Tracef("%s still has %s leadership", t.unitName, t.applicationName)
}
}
return !t.isMinion, nil
}
*that* looks to me like we only renew the lease if we are currently pending
a renewal (so on a 1min lease we only renew on IsLeader if we're past the
30s mark).
Otherwise the:
default: still leader
code triggers and we just return true.
So if the timing was:
0s: renew leadership for 60s
25s: call IsLeader (no actual refresh)
There doesn't appear to be any database activity after isLeader returns
true
All that refreshing would do is increase the window, which we could
probably do in a different way (just increase the lease time).
The other curious bit is the timing from the log:
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"
That is a full 2m35s from the time we see "is-leader" being called before
"leader-set" is then called.
Given the comment here:
_password = leader_get(key)
if not _password and is_leader():
_password = config(key) or pwgen()
return _password
Is pwgen() actually quite slow on a heavily loaded machine? Is it grabbing
lots of entropy/reading from /dev/random rather than /dev/urandom and
getting blocked?
So 2m45s is quite a long time. But also note that other things are
surprisingly slow:
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.
leadership for mysql/2 denied
Is it really taking us ~5minutes to deal with the leader-set call? or are
these 2 separate calls we're dealing with?
I'm assuming mysql/2 is the one running in the "something wonky went on
early".
We see that mysql/2 was set to be the leader at 16:24:
2017-10-27 16:23:14 DEBUG juju.worker.
making initial claim for mysql leadership
2017-10-27 16:24:52 INFO juju.worker.
promoted to leadership of mysql
At 16:36:10 mysql/2 is told its no longer the leader, but at 16:35:30 is
where mysql/0 is told that is now the leader:
2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued
"leader-elected" hook
I'm heading back to the raw logs now, but nearly 3min from a is-lea...
John A Meinel (jameinel) wrote : | #13 |
Side note, we do potentially have a serious issue about responding to relation data and coordination of leadership. Our statement that we guarantee you will have no more than 1 leader at any given time doesn't work well with arbitrary hooks in response to relation data changes.
Here is an example timeline:
0s mysql/0 => becomes the leader (goes unresponsive for a bit)
20s rabbit/0 => joins the relation with mysql and sets data in the relation bucket that only the leader can handle
35s mysql/1 sees rabbits data but is not the leader
35s mysql/2 sees rabbits data but is not the leader
60s mysql/0 demoted, mysql/1 is now the leader
65s mysql/1 sees the relation data from rabbit but is no longer the leader
There is no guarantee that there will be a leader that sees relation change data.
The one backstop would be 'leader-elected', which could go through and re-evaluate if there is anything that the previous leader missed. (look at your existing relations, and see if there was something you didn't handle earlier because you weren't the leader, that the last leader also failed to handle).
All of the above is possible even with nothing wrong with our leader election process. All it takes is for the machine where the leader is currently running to be busy with other hooks (colocated workloads), that it takes too long for what was the leader to actually respond to a relation.
I'd like us to figure out what they need as charmers to actually handle this case. Should there be an idea of "if I become the leader this is what I would want to do", that gets set aside as context that gets presented again as context during leader-elected?
John A Meinel (jameinel) wrote : | #14 |
The logs show that leader-elected isn't implemented, which probably means that you can suffer from comment #13:
2017-10-27 16:35:31 INFO juju-log Unknown hook leader-elected - skipping.
I was discussing with Andrew, and one thing that we are thinking about this cycle is trying to introduce Application <=> Application relation data, rather than just having Unit <=> Application data.
In that context, it would be interesting to consider having a "relation-
The initial scope around Application data bags would not change the hook logic, so it wouldn't actually address this bug, but in the stuff we are calling "charms v2" and trying to change what hooks are fired, we could potentially address it there.
Potentially we could introduce a new hook more easily than deprecating all the existing hooks that we fire. Which would allow you to have something like "application-
John A Meinel (jameinel) wrote : | #15 |
Looking at the charm: https:/
It does have a symlink of "leader-elected => percona_hooks.py"
but the Python code itself is hitting this line:
try:
except UnregisteredHoo
So its more a case that you're not actually responding when leader-elected really is fired.
James Page (james-page) wrote : | #16 |
I think the recommendation in #15 to implement the leader-elected hook, and deal with anything missing at that point in time makes alot of sense.
Tim Penhey (thumper) wrote : | #17 |
I'm going to mark the Juju task invalid for now then based on John's comments above.
Changed in juju: | |
milestone: | 2.3.0 → 2.3-rc1 |
status: | Triaged → Invalid |
milestone: | 2.3-rc1 → none |
assignee: | Andrew Wilkins (axwalk) → nobody |
James Page (james-page) wrote : | #18 |
Setting Juju bug back to New; we can improve the charm but leader switching mid hook execution makes writing charms harder, so we should see if things can be improved.
Changed in juju: | |
status: | Invalid → New |
Ryan Beisner (1chb1n) wrote : | #19 |
Agree with James.
Changing leader whilst a hook is executing on the leader is not something we should expect charms and charmers to trap.
Ryan Beisner (1chb1n) wrote : | #20 |
Can Juju also document the assurances made for leadership election, when/why it is determined to be changed, etc? This would be helpful documentation for charm authors to reference.
John A Meinel (jameinel) wrote : | #21 |
It is taking you 2.5min to go from "is_leader" until we get to
"leader_set". If it is taking that long, your system is under enough load
that we apparently are unable to guarantee keep-alives. (We need a refresh
of leadership which is done by the unit agent every 30s that extends the
leadership for another 1 minute.)
I don't know what exactly is causing it to take 2.5min, but if we can't get
a network request 1/minute then we would allow leadership to lapse.
On Wed, Nov 15, 2017 at 8:24 PM, Ryan Beisner <email address hidden>
wrote:
> Can Juju also document the assurances made for leadership election,
> when/why it is determined to be changed, etc? This would be helpful
> documentation for charm authors to reference.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https:/
>
> Title:
> pxc cluster build failed due to leadership change in early unit
> lifecycle
>
> To manage notifications about this bug go to:
> https:/
>
Dmitrii Shcherbakov (dmitriis) wrote : | #22 |
Sorry, it's a long message but I've got meaningful stuff there (I think).
https:/
https:/
The behavior I encountered in a duplicate bug got me thinking about how to fix this problem at both Juju and charm levels (both will need modifications).
TL;DR:
Juju: revive "leader-deposed" hook work - actually run that hook instead of a no-op (see https:/
Charmers: Modify charms with service-level leadership (not only Juju-level) to use leader-deposed.
Juju: Document when is_leader no longer returns TRUE and think about leader transactions (where a leader executes code and cannot be deposed until finishes execution or its process dies) or document operation interruption semantics (if any).
========
Topic 1.
Description:
For clarity, I will name 2 levels of leadership:
* level 1 (L1): Juju-level per-application unit leadership (a leader unit is an actor here);
* level 2 (L2): application-
What happened (pad.lv/1732257)?
L1 leader got elected and started bootstrapping a cluster so L2 leader got created => L1 leader == L2 leader
L1 minions have not done <peer>-
L1-minion-0 got installed and joined a peer relation with the L1 leader but there are only 2/3 peers (min-cluster-size config option gating) => L2-minion-0 has NOT been set up yet (2/3, not clustered, not an L1 leader - no config rendering, no process running).
L1-leader got deposed, however, did not perform any action to depose L2 leader => **L1-minion-2**
L1-minion-1 became L1-leader and **started** bootstrapping a new cluster => L1 leader != L2 leader => 2 L2 leaders present!
L1-minion-0 started its service and spawned an L2 minion which got cluster state from L1-minion-2 (the old L1 and now contending L2 leader) ***before it got it from L1-leader*** => 2 leaders and 1 minion present - hit a RACE CONDITION on L2
L1-leader (new) set a new bootstrap_uuid leader bucket setting which is inconsistent with L2 UUIDs at L1-minion-0 and L1-minion-2 => hook errors at both L1-minion-0 and L1-minion-2
So in the final state there are no errors on L1-leader (new) as it has bootstrap_uuid that was set by it via leader-set (leader_
2 minions are in a separate L2 cluster and have service-level UUIDs that are inconsistent with the leader setting.
AFAIKS Juju already has a somewhat transactional nature for leadership changes - there is a "Resign" operation and "leader-deposed hook" which apparently is not run (no-op):
https:/
2017-11-14 17:21:32 INFO juju.worker.
2017-11-14 17:21:32 DEBUG juju.worker.
tags: | added: cpe-onsite |
tags: | added: uosci |
Changed in juju: | |
status: | New → Triaged |
John A Meinel (jameinel) wrote : | #23 |
I'm not sure that there is a logic bug in Juju, but we should understand what is going on in the system that is causing us to not refresh leadership correctly. I think the discussion around leader-elected is still relevant.
I'm not sure how much leader-disposed would actually help in this particular case. If you're in the middle of a hook, and you've ever called is_leader should we kill the execution of that script if we want to depose you?
It might work for some of the other cases where you need to tear things down that only got partially set up. Units still get a leader-
Dmitrii Shcherbakov (dmitriis) wrote : | #24 |
Could we provide a guarantee that no unit of a given application will ever consider itself a leader until a previous leader has been deposed in apiserver's view? Likewise, an apiserver should not give any leader tokens until it receives a confirmation that the previous leader has been deposed and ran that hook.
The latter condition is a strong requirement as if there is a network partition and a unit agent is no longer available, apiserver will never elect a new leader. If we introduce a timeout for that this may result in a split-brain unless a unit agent is required to stop executing further operations if there is a connection loss with the apiserver.
We cannot just stop a hook execution because inherently a charm may spawn threads and processes on it's own will which may daemonize and do other arbitrary things on a system during hook execution. Any process tracking mechanisms are operating system-specific (e.g. cgroups) and they can be escaped so we shouldn't even look that way.
The complicated part is that a unit <-> apiserver connection may be lost but a service-level network may be fine (i.e. the loss of L1-relevant connectivity doesn't mean services on L2 have the same picture) - this is the case where we have ToR and BoR switches providing service and management networks respectively on different physical media (switch fabrics). This is a common scenario for us (that's why we have network spaces). In other words: there may be an L1-related partition but not L2-related partition.
I think that in this case a partitioned unit should run leader-deposed which may run L2-related checks to see if this is only the unit <-> apiserver connectivity problem. This is an interesting scenario as the unit agent is isolated in this case and cannot get anything from the apiserver (can't do facade RPC). However, I think this is a useful scenario to model.
As an operator, would you do something like that with your system? Probably yes, you would go out-of-band or in-person and check if this problem impacts only Juju-related connectivity and decide upon service-level impact - this is what you should have in the charm in leader-deposed hook.
===
Now, to having one per-app leader unit running at a time, I believe this is, at least partially, present in Juju.
https:/
// setMinion arranges for lease acquisition when there's an opportunity.
func (t *Tracker) setMinion() error {
...
t.claimLease = make(chan struct{})
go func() {
defer close(t.claimLease)
logger.
err := t.claimer.
if err != nil {
logger.
}
The only part I have not found yet is explicit blocks on leader-deposed on the apiserver side.
What I think we need:
1. leadership-tracker tries to renew the lease;
2. fails as the token has expired;
3. runs leader-deposed hook;
3. meanwhile, apiserver doesn't allow anybody else to claim leadership unit it got EXPLICIT notificatio...
Ante Karamatić (ivoks) wrote : | #25 |
This behavior is critical for us.
Tim Penhey (thumper) wrote : | #26 |
A key problem we have here is that Juju really can't give any guarantees. I spent some time last week talking with Ryan about what Juju can and can't say at any particular point in time.
The short answer is no, Juju cannot guarantee that a new leader won't be elected until a leader deposed hook is executed because the old leader might not be communicative. Consider the situation where there is a hardware failure, and the machine just dies. There is no way for it to run the hook, and if we are waiting, no other unit would ever be elected leader. This isn't reasonable.
Considering that we can't make this guarantee, we whouldn't rely on it.
No, AFAIK we don't have any explicit waits on other units running leader-deposed.
Tim Penhey (thumper) wrote : | #27 |
I think a key thing to note here is the term "guarantee". I think I may have been taking too hard a line with guarantee.
The key thing to think about here is that the leader "shouldn't" change under normal circumstances. So the situations that are causing a leadership change should be the exceptional circumstances.
To be clear, as long as the agents are able to communicate, the leadership shouldn't change.
All the sharp edge cases are at the exceptional edge though. Why would communication drop?
* net splits - I'm still not clear on what causes a net split
* hardware failures
* severly overloaded servers - we should work out how to be more aware of this, perhaps the number of running api calls.
Dmitrii Shcherbakov (dmitriis) wrote : | #28 |
Tim,
net split example: you have Juju controllers and MAAS region controllers sitting on layer 2 networks different from rack controllers and application servers in a data center. E.g. there are 9 racks to manage in different locations within the same DC but you would like to keep the same Juju & MAAS regiond control plane located separately so that you can add more racks. In this case there may be a situation where you lose access to one management network for rack "k" from a Juju controller which is a primary in a replicaset. It's a net split but your applications are unaffected - only machine & unit agents.
I think that what we encounter is mostly deployment-time problems because after a model has converged there is little use for Juju leadership hooks. It may be needed if you need to scale your infrastructure (deployment time again) but by then service-level clustering will have already been done.
Another use-case is rolling upgrades: a single unit should initiate them even if the "rolling" part is managed at the service level. But there are two different types of rolling upgrades:
1. for stateless applications - ordering of operations (by a leader) should be done on the Juju side as this is operator-driven if done manually in many cases. Otherwise we will need a "software-upgrader" application which will have to handle that and maintain the deployment state;
2. stateful applications - service-level quorum awareness is required so a leader unit only initiates an upgrade which is done in software itself.
In the cases I've seen we go through the following logic:
1. a leader unit defines who will bootstrap a service-level cluster;
2. service-level elections are performed (ordered connections to a master, PAXOS, RAFT, Totem RRP etc.);
3. leadership is managed at the service level. Leader settings contain an indication of a completed bootstrap procedure and leadership hooks are no-ops.
A practical example:
1. percona cluster (master bootstraps, slaves join without bootstrapping);
2. new slaves join the quorum;
3. any service-level failure conditions require disaster recovery and manual intervention.
Canonical Juju QA Bot (juju-qa-bot) wrote : | #29 |
This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.
Changed in juju: | |
importance: | High → Low |
tags: | added: expirebugs-bot |
Looking at the mysql log data:
./12/lxd/ 6/var/log/ juju/unit- mysql-2. log
2017-10-27 16:26:09 INFO juju.worker.uniter resolver.go:104 found queued "install" hook
2017-10-27 16:42:24 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook uniter. operation executor.go:69 running operation run leader-elected hook uniter. operation executor.go:100 preparing operation "run leader-elected hook" uniter. operation executor.go:100 executing operation "run leader-elected hook" uniter. operation runhook.go:113 ran "leader-elected" hook uniter. operation executor.go:100 committing operation "run leader-elected hook"
2017-10-27 16:42:24 DEBUG juju.worker.
2017-10-27 16:42:24 DEBUG juju.worker.
2017-10-27 16:42:24 DEBUG juju.worker.
2017-10-27 16:42:24 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:42:25 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:04 INFO juju.worker.
2017-10-27 16:44:04 DEBUG juju.worker.
./0/lxd/ 6/var/log/ juju/unit- mysql-0. log
2017-10-27 16:25:56 INFO juju.worker.uniter resolver.go:104 found queued "install" hook
2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook uniter. operation executor.go:69 running operation run leader-elected hook uniter. operation executor.go:100 preparing operation "run leader-elected hook" uniter. operation executor.go:100 executing operation "run leader-elected hook" uniter. operation runhook.go:113 ran "leader-elected" hook uniter. operation executor.go:100 committing operation "run leader-elected hook" uniter. operation executor.go:69 running operation run leader-elected hook uniter. operation executor.go:100 preparing operation "run leader-elected hook" uniter. operation executor.go:100 executing operation "run leader-elected hook" uniter. operation runhook.go:113 ran "leader-elected" hook uniter. operation executor.go:100 committing operation "run leader-elected hook"
2017-10-27 16:35:30 DEBUG juju.worker.
2017-10-27 16:35:30 DEBUG juju.worker.
2017-10-27 16:35:30 DEBUG juju.worker.
2017-10-27 16:35:30 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:35:31 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:36:50 INFO juju.worker.
2017-10-27 16:36:50 DEBUG juju.worker.
2017-10-27 16:43:57 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:43:57 DEBUG juju.worker.
2017-10-27 16:43:57 DEBUG juju.worker.
2017-10-27 16:43:57 DEBUG juju.worker.
2017-10-27 16:43:57 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:43:59 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:58 INFO juju.worker.
2017-10-27 16:44:58 DEBUG juju.worker.
pxc only installed once the lead unit has actually set the cluster root and sst passwords into leader storage; it would appear that at the time of install, non of the units was the leader, so the data was never seeded into leader storage.