juju-check-wait waits forever if juju doesn't reach stable state

Bug #1694745 reported by Daniel Manrique
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mojo: Continuous Delivery for Juju
Triaged
Medium
Unassigned

Bug Description

Several times we've observed that, if a charm gets "stuck" in e.g. allocating/waiting for machine or other intermediate state, a spec using juju-check-wait will wait forever (we've seen times of 6-10 hours before manually killing things). This despite the fact that the juju-check-wait phase should have a default timeout of 30 minutes.

Example juju output of a "stuck" run:

https://pastebin.canonical.com/189657/ (Apologies for the private link).

Stuck applications are (these are subordinates which ran into trouble, but we've also seen this with an ordinary charm which was stuck waiting to contact an ntp server or something similar):

  telegraf/2 waiting allocating 10.25.61.190 waiting for machine
  telegraf/5 waiting allocating 10.25.61.192 waiting for machine

The spec in question does something like:

deploy config=go-telegraf/services target=go-telegraf delay=0
juju-check-wait
script config=go-telegraf/add-relations
juju-check-wait

The run log for the spec shows (some possibly sensitive information obfuscated)

2017-05-31 08:35:42 [INFO] deployer.import: Deploying applications...
2017-05-31 08:35:43 [INFO] deployer.import: Deploying application telegraf using /some/charm/telegraf
2017-05-31 08:35:51 [DEBUG] deployer.import: Adding units...
2017-05-31 08:35:51 [WARNING] deployer.import: Config specifies num units for subordinate: telegraf
2017-05-31 08:35:51 [DEBUG] deployer.import: Waiting for units before adding relations
2017-05-31 08:35:51 [DEBUG] deployer.env: Connecting to my-juju-controller...
2017-05-31 08:35:52 [DEBUG] deployer.env: Connected.
2017-05-31 08:35:52 [INFO] deployer.import: Adding relations...
2017-05-31 08:35:52 [INFO] deployer.cli: Deployment complete in 10.52 seconds
2017-05-31 08:35:52 [INFO] Checking Juju status (timeout=1800)
2017-05-31 08:35:56 [INFO] Running script go-telegraf/add-relations
2017-05-31 08:36:01 [INFO] Adding relation for telegraf with the following services:
my-app-lb
my-cache-lb
my-rabbitmq
my-dp-fe
my-app
my-memcached
my-prometheus
my-cache

2017-05-31 08:36:01 [INFO] Checking Juju status
2017-05-31 08:36:05 [INFO] Waiting for environment to reach steady state

The spec has been in the above state for over 6 hours.

This has also been observed with a script that does, at the end, "mojo juju-check-wait" - so it would appear to be some trouble in the way mojo handles the juju-check-wait phase.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Just observed this with a juju2 environment on which agents were in a "lost" state.

Running mojo juju-check-wait at 2017-05-31 09:46:06.011687
mojo juju-check-wait completed successfully at 2017-05-31 18:53:02.976279

The check completed after I went in and kicked all the agents, but the wait time was over 9 hours and I guarantee we have no timeout=36000 anywhere in our spec :)

Junien Fridrick (axino)
Changed in mojo:
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.