Sorry, it's a long message but I've got meaningful stuff there (I think). https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/2 https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/3 The behavior I encountered in a duplicate bug got me thinking about how to fix this problem at both Juju and charm levels (both will need modifications). TL;DR: Juju: revive "leader-deposed" hook work - actually run that hook instead of a no-op (see https://git.io/vF1Jn) Charmers: Modify charms with service-level leadership (not only Juju-level) to use leader-deposed. Juju: Document when is_leader no longer returns TRUE and think about leader transactions (where a leader executes code and cannot be deposed until finishes execution or its process dies) or document operation interruption semantics (if any). ======== Topic 1. Description: For clarity, I will name 2 levels of leadership: * level 1 (L1): Juju-level per-application unit leadership (a leader unit is an actor here); * level 2 (L2): application-specific or service-specific leadership (a percona cluster process in this case, no explicit mapping from L2: L1 or L1: L2 leadership) What happened (pad.lv/1732257)? L1 leader got elected and started bootstrapping a cluster so L2 leader got created => L1 leader == L2 leader L1 minions have not done -relation-joined yet => L1 leader cannot tell them that it is the L2 leader and there are no L2 minion processes yet => waits for more -relation-{joined, changed} events L1-minion-0 got installed and joined a peer relation with the L1 leader but there are only 2/3 peers (min-cluster-size config option gating) => L2-minion-0 has NOT been set up yet (2/3, not clustered, not an L1 leader - no config rendering, no process running). L1-leader got deposed, however, did not perform any action to depose L2 leader => **L1-minion-2** L1-minion-1 became L1-leader and **started** bootstrapping a new cluster => L1 leader != L2 leader => 2 L2 leaders present! L1-minion-0 started its service and spawned an L2 minion which got cluster state from L1-minion-2 (the old L1 and now contending L2 leader) ***before it got it from L1-leader*** => 2 leaders and 1 minion present - hit a RACE CONDITION on L2 L1-leader (new) set a new bootstrap_uuid leader bucket setting which is inconsistent with L2 UUIDs at L1-minion-0 and L1-minion-2 => hook errors at both L1-minion-0 and L1-minion-2 So in the final state there are no errors on L1-leader (new) as it has bootstrap_uuid that was set by it via leader-set (leader_settings_bootstrap_uuid == L1-leader_local_service_bootstrap_uuid) 2 minions are in a separate L2 cluster and have service-level UUIDs that are inconsistent with the leader setting. AFAIKS Juju already has a somewhat transactional nature for leadership changes - there is a "Resign" operation and "leader-deposed hook" which apparently is not run (no-op): https://github.com/juju/juju/blame/juju-2.2.6/worker/uniter/operation/leader.go#L74-L79 2017-11-14 17:21:32 INFO juju.worker.uniter.operation runhook.go:113 ran "shared-db-relation-changed" hook 2017-11-14 17:21:32 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run relation-changed (88; neutron-api/1) hook" 2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:89 lock released 2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:69 running operation resign leadership <---- resign OP 2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:69 running operation resign leadership 2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "resign leadership" 2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "resign leadership" 2017-11-14 17:21:33 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet <---- leader-deposed 2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "resign leadership" So, reviving this work would allow us to debootstrap (as in "kill the service and clean persistent state") the old L2 leader **IF** the L2 bootstrap procedure has NOT been completed yet. At the charm level, a charm author has to be careful not to debootstrap on every leader-deposed of course. ===== Topic 2. What I am interested is where do we start getting is_leader == False? 2017-11-14 17:21:27 INFO juju.worker.leadership tracker.go:208 mysql leadership for mysql/2 denied 2017-11-14 17:21:27 DEBUG juju.worker.leadership tracker.go:230 notifying mysql/2 ticket of impending loss of mysql leadership 2017-11-14 17:21:27 DEBUG juju.worker.leadership tracker.go:215 mysql/2 waiting for mysql leadership release 2017-11-14 17:21:27 DEBUG juju.worker.leadership tracker.go:269 mysql/2 is not mysql leader 2017-11-14 17:21:27 DEBUG juju.worker.uniter.remotestate watcher.go:394 got leadership change: minion ^^^ is it here? 2017-11-14 17:21:27 DEBUG worker.uniter.jujuc server.go:178 running hook tool "is-leader" ^^^ what is the result? ==== Topic 3. Now for the odd part with actual data unrelated to leadership (https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/+attachment/5009043/+files/juju-crashdump-b65366d4-4e66-42b7-98a3-cab719fd1529.tar.gz): A peer relation for the percona-cluster application is called "cluster" and the app is named mysql in the model: cluster-relation-joined counts: unit-mysql-0: 1 # <--- ? unit-mysql-1: 2 unit-mysql-2: 2 While unit-mysql-0 had only one -joined event, it still could observe *2* other peers: executor operation related to joining mysql/2 on mysql/0: 2017-11-14 17:22:52 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run relation-joined (28; mysql/2) hook" 2017-11-14 17:22:52 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run relation-joined (28; mysql/2) hook" 2017-11-14 17:22:52 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running cluster-relation-joined hook previous executor operation: 2017-11-14 17:26:37 INFO juju-log shared-db:88: Insufficient number of peer units to form cluster (expected=3, got=2) ... the next operation: 2017-11-14 17:31:05 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run relation-changed (28; mysql/2) hook" 2017-11-14 17:31:05 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run relation-changed (28; mysql/2) hook" 2017-11-14 17:31:05 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running cluster-relation-changed hook 2017-11-14 17:31:48 INFO juju-log cluster:28: Sufficient number of peer units to form cluster 3 ^^^ how is that possible? Logs never reported a join event for mysql/1? How did that become visible in this state snapshot? ^^^ 2017-11-14 17:31:52 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run relation-changed (28; mysql/2) hook" 2017-11-14 17:31:52 DEBUG juju.worker.uniter.operation executor.go:89 lock released 2017-11-14 17:31:52 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-settings-changed hook 2017-11-14 17:31:57 ERROR juju-log bootstrap uuid differs: 0dde54c4-c961-11e7-9775-2f72db67cbda != b3e1b7e8-c95f-11e7-91fa-56c8581eb616