enable-ha can end up in an unsolvable state when there is an error during deployment
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Medium
|
John A Meinel |
Bug Description
Scenario:
$ juju bootstrap maas
$ juju enable-ha -n 3 --to machine1,machine2
When it's already allocated but not yet deployed - kill machine1 in MAAS (it can also be later, then it won't be 'PENDING' but 'DOWN')
Wait for deployment on machine2 to finish, and then:
$ juju enable-ha -n 3 --to machine1
This will demote machine 1@machine1 (still in pending/down state) and add machine 3@machine1
After all the machines are deployed and HA cluster is working (machines 0,2,3 in 'ha-enabled' state) we've got the following situation:
Machine State DNS Inst id Series AZ Message
0 started 10.2.15.254 ct3y86 xenial default Deployed
1 pending 10.2.0.3 b73t3a xenial default Deploying: ubuntu/
2 started 10.2.0.4 tg34tr xenial default Deployed
3 started 10.2.0.3 b73t3a xenial default Deployed
With two Juju machines set up on one MAAS machine.
To clean up the HA state:
$ juju enable-ha -n 3
maintaining machines: 0, 2, 3
removing machines: 1
And now we can safely remove machine 1:
$ juju remove-machine 1 --force
According to juju everything is OK:
Machine State DNS Inst id Series AZ Message
0 started 10.2.15.254 ct3y86 xenial default Deployed
2 started 10.2.0.4 tg34tr xenial default Deployed
3 started 10.2.0.3 b73t3a xenial default Deployed
But the facts are that b73t3a was decomissioned by Juju from MAAS and it's really dead (juju doesn't notice it for quite some time - IMHO it should be more robust)
I haven't found a way to remove this pending/dead 'doppelganger' machine from machines list, maybe a '--no-action' switch to remove-machine would be needed?
tags: | added: 4010 |
description: | updated |
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → Medium |
tags: | added: ha polish remove-machine |
tags: | added: cpe-onsite |
Changed in juju: | |
status: | Fix Committed → Fix Released |
It is a general recipe for disaster - add a machine, remove it in MAAS (in real world that would be e.g. HDD failure, all hw identifiers stay the same but the machine is 'clean'), and then add it again in juju, we end up with:
Machine State DNS Inst id Series AZ Message
0 down 10.2.0.7 axp6tp xenial default Deployed
1 started 10.2.0.7 axp6tp xenial default Deployed