Bug #1671588 “Remove MAAS machines in Failed_Deployment state” : Bugs : Canonical Juju

Frode Nordahl (fnordahl) on 2017-03-09

tags:

added: sts

Revision history for this message

John A Meinel (jameinel) wrote on 2017-03-16:

#1

Manually destroying machines under Juju is always going to be a bit problematic. Ideally we wouldn't have an instance id associated with a machine that failed to deploy.
If you didn't Manually touch NAAS could you "Juju remove-machine 87" before freeing the machine on MAAS? You might need --force to indicate that it is ok to remove it even though we can't talk to its agent.

Conceptually it would be nice if we could notice that the link between the machine we asked for was broken.

However if you consider multiple users and multiple Juju controllers there is nothing that Juju itself can do to see that the same machine was provisioned for another instance.

One argument was that MAAS IDs should be per request rather than giving an identical handle out across multiple provision + release + acquire steps.

It might be possible for Juju to notice that a machine we had failed deployment and then out of band went out of its last known state. At best that feels racy because the machine may transition states faster than we can notice. (It may go back into FAILED_DEPLOYMENT a second time and you have 2 machines that are now referencing the same underlying machine and neither knows about the other.)

Is it that much of a problem to "remove-machine" before you like MAAS directly?

Revision history for this message

John A Meinel (jameinel) wrote on 2017-03-16:

#2

Given there should be a workaround (remove the failed machine from Juju before releasing it from MAAS), and given that MAAS doesn't really give us an appropriate handle or notification to know that we should be removing the instance ID ourselves, I'm not sure if we can cleanly fix this.
Especially once you extend this to multiple controllers interacting with the same instance ids.

The best I could think of is that if we could tell a machine transitioned to a state we didn't ask for, we would assume it was no longer under our control, but that feels like it would be ultimately brittle.

Changed in juju:
importance:	Undecided → Medium
status:	New → Incomplete

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2017-03-16:

#3

Multiple Juju controllers talking to the same MaaS was not within the scope of the original report, that would indeed open a different and broader set of problems surrounding the same matter, not necessarily to be solved here. When referring to multiple models I meant multiple models on the same controller.

To illustrate how this problem might manifest itself I will tell a short story.

Let us say that we have an ideal organization, "Fair Clouds". At "Fair Cloud" there are three rather nice people that likes Juju and MaaS very much; Alice, Bob, and Charlie.

Alice's passion is developing mobile applications that deliver spot-on content at blazing speed to a large number of followers. To be able to do that she needs a very effective and elastic infrastructure, and she of course leverages Juju and MaaS to do that in the Juju model called "mobile-app" on Juju controller "fair-clouds".

Bob's passion is billing and accounting for the revenue of Alice's advertising customers, being a executive of the time he solves a large part of this task with automation and he too has requirements for his infrastructure represented by the Juju model called "show-me-the-money" on Juju controller "fair-clouds".

Charlie's passion is running a data center, digging out the great deals on power, cooling, racks and servers, making everything work smooth as clockwork. His responsibilities are to at all times have enough metal ready and in good shape to be consumed by the needs of Alice and Bob.

One day, just before lunch, Alice needs to scale up one of her applications by adding some units. She knows that this will be no problem for her infrastructure to handle so she issues the commands to Juju and runs out to catch her lunch appointment.

Bob's routine is different, he always fills his time before, and well in to, lunch with meetings. After a short lunch at his desk he needs to meet all the promises and expectations he has made during the first part of the day. He needs to complete these tasks quickly and scales up one of his applications with Juju to handle the load.

In the mean time, after Alice left, and before Bob finished his lunch, Charlie received an alarm telling him that there was an issue with one of his metal servers and that it or unknown reasons had entered the FAILED_DEPLOYMENT state. He quickly moved to diagnose the issue, resolved it and proudly released the server for new re-use and committed service in record time.

Alice's lunch took a long time, she came back and found that one of the units she attempted to add earlier in the day had not been successful, she moved on to clean this up and try again.

The moment she did that, Bob saw that his newly spun up applications stopped responding.

In my mind the simplest solution is to prevent the duplicate from ever entering the database. Is it not possible to solve this by not allowing insertion of two identical instance IDs into Juju's database?