Remove MAAS machines in Failed_Deployment state

Bug #1671588 reported by Frode Nordahl
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth
2.2
Fix Released
High
Ian Booth

Bug Description

Juju version: 2.0.2
MaaS version: 2.1

Juju does not have unique constraint for MaaS provider instance IDs. In some circumstances this leads Juju to add multiple machines to a model that references the same physical machine when using the MaaS provider.

This can have severe consequences as a remove-machine operation that appears benign to the operator could have severe unwanted and unexpected side effects.

Excerpt from juju status:
Model Controller Cloud/Region Version
openstack maas-controller maas 2.0.2

App Version Status Scale Charm Store Rev OS
ceph-osd 10.2.5 waiting 17/25 ceph-osd jujucharms 239 ubuntu
ntp 4.2.8p4+dfsg active 17/18 ntp jujucharms 16 ubuntu
rsyslog-forwarder-ha unknown 17/18 rsyslog-forwarder-ha jujucharms 7 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-osd/10 waiting allocating 87 a.b.c.d waiting for machine
ceph-osd/14 active idle 105 a.b.c.d Unit is ready (11 OSD)
  ntp/102 active idle a.b.c.d Unit is ready
  rsyslog-forwararder unknown idle a.b.c.d

Machine State DNS Inst id Series AZ
87 down a.b.c.d df8mwd xenial default
105 started a.b.c.d df8mwd xenial default

As you can imagine, if an operator removes machine 87 from the model to clean up, Juju would tell MaaS to release the very same machine that hosts machine 105 in the model, wiping out its payload.

The model ended up in this state because of the following chain of events.
1) Call to juju add-unit ceph-osd
2) Juju requests new machine from MaaS, but the machine ends up in FAILED_DEPLOYMENT state
3) User releases machine from MaaS, Machine in down state is not removed from Juju model
4) Call to juju add-unit ceph-osd
5) Juju requests new machine from MaaS, and gets the same machine, this time deployment succeeds

This issue can manifest itself in other ways as well: what happens if this duplicate is in another model? Juju status won't advertise the duplicate, so you could easily terminate someone's else machine.

Frode Nordahl (fnordahl)
tags: added: sts
Revision history for this message
John A Meinel (jameinel) wrote :

Manually destroying machines under Juju is always going to be a bit problematic. Ideally we wouldn't have an instance id associated with a machine that failed to deploy.
If you didn't Manually touch NAAS could you "Juju remove-machine 87" before freeing the machine on MAAS? You might need --force to indicate that it is ok to remove it even though we can't talk to its agent.

Conceptually it would be nice if we could notice that the link between the machine we asked for was broken.

However if you consider multiple users and multiple Juju controllers there is nothing that Juju itself can do to see that the same machine was provisioned for another instance.

One argument was that MAAS IDs should be per request rather than giving an identical handle out across multiple provision + release + acquire steps.

It might be possible for Juju to notice that a machine we had failed deployment and then out of band went out of its last known state. At best that feels racy because the machine may transition states faster than we can notice. (It may go back into FAILED_DEPLOYMENT a second time and you have 2 machines that are now referencing the same underlying machine and neither knows about the other.)

Is it that much of a problem to "remove-machine" before you like MAAS directly?

Revision history for this message
John A Meinel (jameinel) wrote :

Given there should be a workaround (remove the failed machine from Juju before releasing it from MAAS), and given that MAAS doesn't really give us an appropriate handle or notification to know that we should be removing the instance ID ourselves, I'm not sure if we can cleanly fix this.
Especially once you extend this to multiple controllers interacting with the same instance ids.

The best I could think of is that if we could tell a machine transitioned to a state we didn't ask for, we would assume it was no longer under our control, but that feels like it would be ultimately brittle.

Changed in juju:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Multiple Juju controllers talking to the same MaaS was not within the scope of the original report, that would indeed open a different and broader set of problems surrounding the same matter, not necessarily to be solved here. When referring to multiple models I meant multiple models on the same controller.

To illustrate how this problem might manifest itself I will tell a short story.

Let us say that we have an ideal organization, "Fair Clouds". At "Fair Cloud" there are three rather nice people that likes Juju and MaaS very much; Alice, Bob, and Charlie.

Alice's passion is developing mobile applications that deliver spot-on content at blazing speed to a large number of followers. To be able to do that she needs a very effective and elastic infrastructure, and she of course leverages Juju and MaaS to do that in the Juju model called "mobile-app" on Juju controller "fair-clouds".

Bob's passion is billing and accounting for the revenue of Alice's advertising customers, being a executive of the time he solves a large part of this task with automation and he too has requirements for his infrastructure represented by the Juju model called "show-me-the-money" on Juju controller "fair-clouds".

Charlie's passion is running a data center, digging out the great deals on power, cooling, racks and servers, making everything work smooth as clockwork. His responsibilities are to at all times have enough metal ready and in good shape to be consumed by the needs of Alice and Bob.

One day, just before lunch, Alice needs to scale up one of her applications by adding some units. She knows that this will be no problem for her infrastructure to handle so she issues the commands to Juju and runs out to catch her lunch appointment.

Bob's routine is different, he always fills his time before, and well in to, lunch with meetings. After a short lunch at his desk he needs to meet all the promises and expectations he has made during the first part of the day. He needs to complete these tasks quickly and scales up one of his applications with Juju to handle the load.

In the mean time, after Alice left, and before Bob finished his lunch, Charlie received an alarm telling him that there was an issue with one of his metal servers and that it or unknown reasons had entered the FAILED_DEPLOYMENT state. He quickly moved to diagnose the issue, resolved it and proudly released the server for new re-use and committed service in record time.

Alice's lunch took a long time, she came back and found that one of the units she attempted to add earlier in the day had not been successful, she moved on to clean this up and try again.

The moment she did that, Bob saw that his newly spun up applications stopped responding.

In my mind the simplest solution is to prevent the duplicate from ever entering the database. Is it not possible to solve this by not allowing insertion of two identical instance IDs into Juju's database?

Changed in juju:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

We have had similar issues on AWS where we were not cleanly removed instances in stopped state. We have done some improvements in this area under the banner of observability. The same scenario works fine now - stopped instances are removed.

It looks like we need to do a similar polish on MAAS provider.

Changed in juju:
status: New → Triaged
importance: Medium → High
summary: - Juju does not have unique constraint for MaaS provider instance IDs
+ Remove MAAS machines in Failed_Deployment state
Changed in juju:
milestone: none → 2.2-beta1
milestone: 2.2-beta1 → 2.3.0
Felipe Reyes (freyes)
tags: added: maas-provider
Revision history for this message
Felipe Reyes (freyes) wrote :

We found another environment that had several machines duplicated (1 on failed deployment and 1 deployed OK), the workaround used is:

0) Switch to the model where there are duplicated machines
  $ juju switch SOME_MODEL
1) Identify the number of the machine(s) that want to be *removed* (e.g. 33)
2) Set the harvest mode to none
  $ juju model-config provisioner-harvest-mode=none
3) Remove the machine previously identified
  $ juju remove-machine --force MACHINE_NUMBER
4) monitor the progress in juju status and in the maas web ui
  - juju status should stop displaying the machine that failed to deploy in the maas
  - the maas web ui should *not* show any change in the system, at this level juju shouldn't be making any changes.
  - Wait a few minutes before to proceed to the next step to make sure any background task is completed
5) Repeat steps 3 and 4 for each duplicated machine marked as "failed deployment"
6) Restore the default value of the harvester mode
  $ juju model-config --reset provisioner-harvest-mode

By luck we found this during a live session and we asked the operator to NOT remove the machines in failed state, an uninformed user could naively run "juju remove-machine --force X" and end up losing data. Considering there is risk of losing data, I think this bug should be marked as "Critical".

Revision history for this message
Ian Booth (wallyworld) wrote :

A root cause here appears to be that MAAS initially returns success for a start node api call. The node may go to "failed deployment" after success has already been reported to Juju. This has several implications, since Juju will go ahead an try to use that machine - assign units to it etc. Juju would need to change to poll the machine status after a start call and delay returning until the machine is reported as fully operational, but as observed in openstack, this has a detrimental effect on deployment times (and even more so for MAAS).

What we could look to do is change machine removal from Juju so that if a machine is marked as Down (ie it never was recorded as starting up), do not attempt to ask MAAS to stop the instance unless the user forces.

We could disallow duplicate instance ids in a given controller, but this just masks the issue. There's nothing stopping 2 separate controllers running against a MAAS from experiencing the same issue.

Revision history for this message
Ian Booth (wallyworld) wrote :

Investigating some more, Juju will mark a machine Down if the agent cannot connect for some reason. There's no 100% resilient way to determine if a machine is Down because the agent never started due to a provisioning error like the one described in this bug.

We really need MAAS to grow the capability to inform Juju if the original instance which was anticipated to satisfy the node provisioning request has failed and a new one chosen. This could involve a token being returned by start instance instead of an instance id - the token would be used to subsequently query MAAS for the status of the provisioning operation. I believe this might be on the MAAS roadmap for a future release.

We can tag cloud instances with the Juju machine id (we already tag with controller and model uuid). Then when a Juju remove machine operation occurs, if the cloud instance tag did not match the current Juju machine id, it would not be stopped. However, the internal Juju apis would need far reaching change to allow this so is not feasible for a point release. We'll still look to add this new tag as it's generally useful to be able to correlate a cloud instance back to it's Juju representation. Note that some providers like AWS already add an extra tag similar to this.

What we'll do for this release is add an extra flag to juju remove-machine. The "keep-instance" flag will cause the machine to be removed from Juju's model, but the cloud instance will not be stopped. This achieves a similar effect to the workaround in comment #5, but easier to use. And I think it may be generally useful. Looking at a combination of juju status output plus the new tag mentioned above should make it easy enough to determine if the --keep-instance flag should be used in a given scenario.

Revision history for this message
Ian Booth (wallyworld) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

Marking this as Fix Committed for 2.2.3 since the above PR gives a workaround to destroy a mad machine while leaving it running. Any work to change MAAS/Juju interaction is more substantial and would need to be done in a future release.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Thank you for your work on this Ian!

Tim Penhey (thumper)
Changed in juju:
milestone: 2.3.0 → 2.3-beta2
status: Triaged → Fix Released
assignee: nobody → Ian Booth (wallyworld)
Michał Ajduk (majduk)
tags: added: 4010
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.