Remove MAAS machines in Failed_Deployment state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Ian Booth | ||
2.2 |
Fix Released
|
High
|
Ian Booth |
Bug Description
Juju version: 2.0.2
MaaS version: 2.1
Juju does not have unique constraint for MaaS provider instance IDs. In some circumstances this leads Juju to add multiple machines to a model that references the same physical machine when using the MaaS provider.
This can have severe consequences as a remove-machine operation that appears benign to the operator could have severe unwanted and unexpected side effects.
Excerpt from juju status:
Model Controller Cloud/Region Version
openstack maas-controller maas 2.0.2
App Version Status Scale Charm Store Rev OS
ceph-osd 10.2.5 waiting 17/25 ceph-osd jujucharms 239 ubuntu
ntp 4.2.8p4+dfsg active 17/18 ntp jujucharms 16 ubuntu
rsyslog-
Unit Workload Agent Machine Public address Ports Message
ceph-osd/10 waiting allocating 87 a.b.c.d waiting for machine
ceph-osd/14 active idle 105 a.b.c.d Unit is ready (11 OSD)
ntp/102 active idle a.b.c.d Unit is ready
rsyslog-
Machine State DNS Inst id Series AZ
87 down a.b.c.d df8mwd xenial default
105 started a.b.c.d df8mwd xenial default
As you can imagine, if an operator removes machine 87 from the model to clean up, Juju would tell MaaS to release the very same machine that hosts machine 105 in the model, wiping out its payload.
The model ended up in this state because of the following chain of events.
1) Call to juju add-unit ceph-osd
2) Juju requests new machine from MaaS, but the machine ends up in FAILED_DEPLOYMENT state
3) User releases machine from MaaS, Machine in down state is not removed from Juju model
4) Call to juju add-unit ceph-osd
5) Juju requests new machine from MaaS, and gets the same machine, this time deployment succeeds
This issue can manifest itself in other ways as well: what happens if this duplicate is in another model? Juju status won't advertise the duplicate, so you could easily terminate someone's else machine.
tags: | added: sts |
tags: | added: maas-provider |
Changed in juju: | |
milestone: | 2.3.0 → 2.3-beta2 |
status: | Triaged → Fix Released |
assignee: | nobody → Ian Booth (wallyworld) |
tags: | added: 4010 |
Manually destroying machines under Juju is always going to be a bit problematic. Ideally we wouldn't have an instance id associated with a machine that failed to deploy.
If you didn't Manually touch NAAS could you "Juju remove-machine 87" before freeing the machine on MAAS? You might need --force to indicate that it is ok to remove it even though we can't talk to its agent.
Conceptually it would be nice if we could notice that the link between the machine we asked for was broken.
However if you consider multiple users and multiple Juju controllers there is nothing that Juju itself can do to see that the same machine was provisioned for another instance.
One argument was that MAAS IDs should be per request rather than giving an identical handle out across multiple provision + release + acquire steps.
It might be possible for Juju to notice that a machine we had failed deployment and then out of band went out of its last known state. At best that feels racy because the machine may transition states faster than we can notice. (It may go back into FAILED_DEPLOYMENT a second time and you have 2 machines that are now referencing the same underlying machine and neither knows about the other.)
Is it that much of a problem to "remove-machine" before you like MAAS directly?