retry-provisioning doesn't retry failed deployments on MAAS

Bug #1645422 reported by Adam Collard
108
This bug affects 20 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Using MAAS 2.1.2 (bzr 5555) and Juju 2.0.1:

I tried deploying 6 units of Ubuntu, each with a LXD container also running Ubuntu. Two of the machines failed to deploy (because of bug 1635560 but unimportant - just note that it's transient). When I tried to retry-provisioning nothing happened.

⟫ juju status
Model Controller Cloud/Region Version
default hare hare 2.0.1

App Version Status Scale Charm Store Rev OS Notes
ubuntu 16.04 waiting 8/12 ubuntu jujucharms 8 ubuntu

Unit Workload Agent Machine Public address Ports Message
ubuntu/0 active idle 0 10.2.0.54 ready
ubuntu/1* active idle 1 10.2.0.55 ready
ubuntu/2 active idle 2 10.2.0.56 ready
ubuntu/3 active idle 3 10.2.0.57 ready
ubuntu/4 waiting allocating 4 10.2.0.52 waiting for machine
ubuntu/5 waiting allocating 5 10.2.0.53 waiting for machine
ubuntu/6 active idle 0/lxd/0 10.2.0.61 ready
ubuntu/7 active idle 1/lxd/0 10.2.0.58 ready
ubuntu/8 active idle 2/lxd/0 10.2.0.60 ready
ubuntu/9 active idle 3/lxd/0 10.2.0.59 ready
ubuntu/10 waiting allocating 4/lxd/0 waiting for machine
ubuntu/11 waiting allocating 5/lxd/0 waiting for machine

Machine State DNS Inst id Series AZ
0 started 10.2.0.54 4y3hbp xenial Raphael
0/lxd/0 started 10.2.0.61 juju-d0b4d0-0-lxd-0 xenial
1 started 10.2.0.55 4y3hbq xenial default
1/lxd/0 started 10.2.0.58 juju-d0b4d0-1-lxd-0 xenial
2 started 10.2.0.56 abnf8x xenial Raphael
2/lxd/0 started 10.2.0.60 juju-d0b4d0-2-lxd-0 xenial
3 started 10.2.0.57 x7nfeg xenial default
3/lxd/0 started 10.2.0.59 juju-d0b4d0-3-lxd-0 xenial
4 down 10.2.0.52 4y3h7x xenial Raphael
4/lxd/0 pending pending xenial
5 down 10.2.0.53 4y3h7y xenial default
5/lxd/0 pending pending xenial

⟫ juju retry-provisioning 5 --debug
18:07:46 INFO juju.cmd supercommand.go:63 running juju [2.0.1 gc go1.6.2]
18:07:46 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "retry-provisioning", "5", "--debug"}
18:07:46 INFO juju.juju api.go:72 connecting to API addresses: [10.2.0.51:17070]
18:07:46 INFO juju.api apiclient.go:530 dialing "wss://10.2.0.51:17070/model/5a113b53-5bf4-42cd-8d8f-4dd933d0b4d0/api"
18:07:47 INFO juju.api apiclient.go:466 connection established to "wss://10.2.0.51:17070/model/5a113b53-5bf4-42cd-8d8f-4dd933d0b4d0/api"
18:07:47 DEBUG juju.juju api.go:263 API hostnames unchanged - not resolving
18:07:47 INFO cmd supercommand.go:465 command finished

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.1.0
Revision history for this message
Curtis Hovey (sinzui) wrote :

Why can't juju automatically retry-provisioning? It knows many cases where provisioning failed. juju is retrying hooks now; users rarely need to retry.

Curtis Hovey (sinzui)
tags: added: maas-provider retry-privisioning
Changed in juju:
importance: Critical → High
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Removing 2.1 milestone as we will not be addressing this issue in 2.1.

tags: added: retry-provisioning
removed: retry-privisioning
Changed in juju:
milestone: 2.1-rc2 → none
Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

I hit this today on Juju 2.1.1 and MAAS 2.1.3.
retry-provisioning does nothing and the machine is just down/pending.

Revision history for this message
John A Meinel (jameinel) wrote :

I believe the underlying issue is that maas has handed us an 'instance-id' which I think means that we think we have a concrete instance that is running. Which is different from failing-to-get-an-instance. Its possible retry-provisioning handles the latter but not the former.

Revision history for this message
John A Meinel (jameinel) wrote :

I should also note that MAAS doesn't hand back 'instance for the request you made' but always hands back an exact identifier for a specific machine. So we have to be a bit careful that 'retry-provisioning' properly decommissions the existing instance id, and can cope with being given back exactly the same instance ID a second time, but this time meaning it is trying again.

tags: added: 4010
tags: added: cdo-qa foundation-engine
tags: added: foundations-engine
removed: foundation-engine
tags: removed: foundations-engine
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The same for tags updated after 'juju deploy'.

Retry-provisioning should re-query machine metadata if said so in my view. This is a manual action and you probably know what you are doing.

Instead, one has to remove-machine --force and add-unit again.

tags: added: cpe-onsite
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1645422] Re: retry-provisioning doesn't retry failed deployments on MAAS

fwiw, I think the internal issue is that MAAS has already given us an
instance-id, so we think the machine is provisioned. Normally for providers
'juju retry-provisioning' probably does do some of what you want, but only
when an instance hasn't yet been assigned.

On Mon, Oct 23, 2017 at 6:59 AM, Dmitrii Shcherbakov <
<email address hidden>> wrote:

> The same for tags updated after 'juju deploy'.
>
> Retry-provisioning should re-query machine metadata if said so in my
> view. This is a manual action and you probably know what you are doing.
>
> Instead, one has to remove-machine --force and add-unit again.
>
> ** Tags added: cpe-onsite
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1645422
>
> Title:
> retry-provisioning doesn't retry failed deployments on MAAS
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1645422/+subscriptions
>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I think we just need to define what it means to "provision" better.

Conceptually, I would use the following definition:

provisioning = <matching a machine by constraints & other criteria> + <successfully deploying once and installing a machine agent>

At least for MAAS it is intuitive in my view.

If I have to reconfigure a machine, doing retry-provisioning also makes sense but with the following logic:

1. get a machine ID;
2. a deployment has failed either automatically or via a manual action before machine/unit agents have started;
3. a user has released the machine in MAAS;
4. reconfigured the machine/swapped out hardware etc.
5. a manual retry-provisioning detected that a given ID is no longer allocated and tried to allocate a new ID.

The target idea here would be that one could write an orchestrator/automation to talk to Juju, see if a deployment has failed, check MAAS to determine if we can recover from a failure, retry-provisioning without affecting a Juju model unit-wise or application-wise.

If a node is not suitable it would be marked as broken by an orchestrator in MAAS and a different node would be picked without making remove-machine --force && add-unit steps.

tags: added: canonical-bootstack
Revision history for this message
Frode Nordahl (fnordahl) wrote :

This is still an issue with Juju 2.7.8 and MAAS 2.8.2.

My occurrence is a transient MAAS failed deployment because of *reasons*, and I want Juju to retry so that I can get a working machine.

I see from bug discussion history that there is some disagreement about what retry-provisioning means or does, and I guess I'll add to the scale that to me I expected it to mean that Juju could re-use the machine slot it has in its model and either fill it with a new instance or reach out to maas and do a release+deploy dance with the instance ID it already has.

Right now nothing happens and there is zero feedback to the user.

Frode Nordahl (fnordahl)
tags: added: ps5
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Typo in comment #9 juju version is 2.8.7

Revision history for this message
Pen Gale (pengale) wrote :

Bumping importance to Medium to accurately reflect that this is a legitimate issue, but is not in scope for the current roadmap.

(I agree that it would be very nice to fix.)

Changed in juju:
importance: High → Medium
Revision history for this message
Boris Lukashev (rageltman) wrote :

This is a legitimate issue for us as well (currently contributing to descent into madness) - without this, failed nodes get re-numbered, and targeted placements of units aiming to be on those nodes get wonky despite the --map-machines flag on iterative overlays (up to 5 here for openstack ha with vault and a bunch of other stuff).
The problem also exists with LXD units - Juju has no way to retry those, and that's entirely within its scope of control.

Revision history for this message
Simon Déziel (sdeziel) wrote :

I can confirm this with juju 2.9.11 interacting with MAAS 3.1.0~alpha1.

Simon Déziel (sdeziel)
tags: added: lxd-cloud
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.