Deployer fails because juju thinks it is upgrading

Bug #1460171 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-ci-tools
Fix Released
Critical
Curtis Hovey
juju-core
Fix Released
High
Ian Booth
python-jujuclient
Fix Released
High
Ian Booth

Bug Description

The maas deoloyer job is failing. Deployer is blocked/disconnected because juju says it is upgrading.
    http://reports.vapour.ws/releases/2709/job/maas-1_7-deployer/attempt/397

Juju cannot be upgrading, or at least, it is not possible to upgrade in this case because the version in test is 1.24-beta6, which is the newest version in the test streams. We see a download of
    https://swift.canonistack.canonical.com/v1/AUTH_526ad877f3e3464589dc1145dfeaac60/juju-dist/testing/tools/releases/juju-1.24-beta6-trusty-amd64.tgz
and their is no greater version in the streams created and confirmed in the logs output.

This regression may be caused by...which would be ironic
Commit 2b71c0d Merge pull request #2441 from wallyworld/tools-upgrade-before-api …

Related branches

Revision history for this message
Ian Booth (wallyworld) wrote :

The term "upgrade" may mean one of two things:

1. Juju running upgrade steps to upgrade an older environment
2. Juju upgrading agent tools

The message in this bug refers to item 1, but with the change in behaviour of Juju bootstrap, is now poorly worded.
What happens now is:

1. Juju bootstraps and starts the machine agent on the bootstrap node
2. the machine agent delays activating the until:
  a. any upgrade steps are run
  b. it has determined that no agent upgrades are needed <-- this is new
3. once all upgrade (agent or steps) related tasks are finished, the full api is enabled

So if a deploy is attempted before the full api is enabled, the "upgrade in progress" error is returned.

Before the above change, the deployer would connect immediately after bootstrap and if an implicit upgrade were done, the deployer would be disconnected part way through it's deployment process.

Now what happens is more correct - any attempt to do work with Juju while the state server is not ready is rejected up front, rather than accepting a connection and then disconnecting.

The same response in this bug would happen if the user typed fast and did a deploy immediately after bootstrap - they would be told to try again in a sort time.

Ideally here the deployer would "do the right thing" and retry.

Revision history for this message
Ian Booth (wallyworld) wrote :

I question whether this is a regression - the same "error" or behaviour would always been possible/likely if the user did:

juju upgrade-juju && juju-deploy

The deployer would receive the message about an upgrade being in progress.

Without the change in behaviour

juju bootstrap && juju-deploy

also failed, but in a way that is less obvious and useful - a unexpected disconnect.

Now the state server doesn't even accept the deploy request in the first place until it is ready to act on it, and instead returns an error which the deployer can retry on.

Revision history for this message
Ian Booth (wallyworld) wrote :

A solution would be to delay the API, limited or otherwise, availability until after the agent upgrade check. But this would have 2 issues:

1. juju status would be slightly delayed until it started working
2. would still not solve the agent upgrade error that the deployer would get when an agent upgrade is running

The limited API would still be available while the upgrade steps are running, but not until after the agent upgrade check completes.

Revision history for this message
Ian Booth (wallyworld) wrote :

Interestingly, it doesn't always fail. On AWS:

juju bootstrap --upload-tools && juju --debug deployer --deploy-delay 10 --config ~/landscape-scalable.yaml

works fine.

Another option is to teach the juju-deployer how to retry on upgrade errors. This would be much easier to implement.

Revision history for this message
Ian Booth (wallyworld) wrote :

The python-jujuclient code has been modified so that if any RPC call results in an "upgrade in progress" error, then the call will be retried.

This change improves the robustness of the deployer overall.

https://code.launchpad.net/~wallyworld/python-jujuclient/retry-on-upgrade/+merge/260658

Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Changed in python-jujuclient:
assignee: nobody → Ian Booth (wallyworld)
status: New → In Progress
David Britton (dpb)
Changed in python-jujuclient:
status: In Progress → Fix Committed
David Britton (dpb)
Changed in python-jujuclient:
importance: Undecided → High
Revision history for this message
Curtis Hovey (sinzui) wrote :

I changed the deployer test to call client.wait_for_started() after bootstrap. We can see that 17 seconds can pass before the the bootstrapped state-server is ready.

Changed in juju-ci-tools:
assignee: nobody → Curtis Hovey (sinzui)
importance: Undecided → Critical
status: New → Fix Released
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Critical → High
Revision history for this message
Ian Booth (wallyworld) wrote :

I added code to juju bootstrap to delay the exit of the bootstrap command until the API is fully available. This will also alleviate the problem, without the need for a delay in deployer.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
tags: added: tech-debt
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Curtis Hovey (sinzui)
Changed in python-jujuclient:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.