juju wait can wait forever

Bug #1496130 reported by Ryan Beisner
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juju Wait Plugin
Fix Released
Undecided
Stuart Bishop
Mojo: Continuous Delivery for Juju
Fix Released
High
Tom Haddon

Bug Description

Starting with mojo 0.1.14, our mojo runs are eternally hanging. Last known good was @ mojo 0.1.12.

I believe this is due to juju wait logic, which I also believe to be ineffective on juju versions with leadership election.

# Passing:
00:02:04.475 Mojo verison:
00:02:04.757 0.1.12
...
00:50:16.199 2015-09-14 16:15:53 [INFO] deployer.cli: Deployment complete in 1646.93 seconds
  ^ consistently completes OK

# Failing:
00:02:37.107 Mojo verison:
00:02:37.509 0.1.14
...
2015-09-15 18:22:08 [INFO] Waiting for environment to reach steady state
 ^ stuck there for 4+hrs
 ^ consistently hangs (3 jobs in a row)

juju:
  Installed: 1.24.5-0ubuntu1~14.04.1~juju1
  Candidate: 1.24.5-0ubuntu1~14.04.1~juju1

juju-core:
  Installed: 1.24.5-0ubuntu1~14.04.1~juju1
  Candidate: 1.24.5-0ubuntu1~14.04.1~juju1

juju-deployer:
  Installed: 0.5.1-3
  Candidate: 0.5.1-3

Related branches

Revision history for this message
David Ames (thedac) wrote :

Tom,

Can we make the default for the new deployer config, wait, False? With juju >= 1.24 and leadership election juju wait does not work because leadership election is continually pinging the logs and causes the above hanging.

Also FYI, Ryan is on the front lines of testing and has found that juju wait even with juju < 1.24 and no leadership election still has race conditions.

The ultimate solution will be charms using workgroup status to declare when they are "Ready".

Changed in mojo:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Tom Haddon (mthaddon)
Revision history for this message
Stuart Bishop (stub) wrote :

I have not seen this problem with the algorithm and Juju 1.24. If you have a snippet from your logs, I'd be interested in seeing what it looks like.

I think I understand what is happening. The wait statement does two complete passes over each unit, using juju run to collect the last line of the juju log file. If the last lines have not been modified, no hooks have run on any unit and we know as best we can that the environment is stable. If not, we keep doing the check until two complete passes report no changes. If two complete passes cannot complete in 30 seconds, then whatever keeps the leadership lease alive might spam the juju logs breaking the process.

I think this is fixable by ignoring these particular log messages, which is probably just a case of adding a 'grep -v' to the juju run command.

I also have a branch that uses Juju 1.24's unit status, but I am suspicious of races. It won't help for Juju 1.23, which has leadership but not unit status.

Please support Bug #1488777, which means we can drop this hack altogether. Its been acknowledged as needed since forever, but never gets scheduled.

Revision history for this message
Stuart Bishop (stub) wrote :

@thedac - charms cannot declare themselves ready, because they have no way of knowing when if there are hooks still scheduled to run, or when the operator (in this case, mojo) has stopped issuing commands.

For others who have had trouble with juju-wait, I'd love a bug report. I have *never* seen this algorithm fail in my tests, and have never had a report of it failing. I have only ever received a single bug report, and that was from running multiple 'juju run' commands in parallel (it confuses juju). I have heard of people suspicious about it, because it is taking a long time (and turns out, correctly waiting as the system is still busy running hooks despite it not being visible to the end user). Its been used in my amulet tests for ages, and I keep recommending it whenever I see charmers tripping over wait problems. So if it doesn't work, please tell somebody. Now we need to fix this bug twice (once in mojo, once in juju-wait).

Revision history for this message
Tom Haddon (mthaddon) wrote :

We've released 0.1.14-1 to the PPA which sets wait as False by default, but would still very much like to see logs as Stuart mentions above.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Note that this spec is deploying a full OpenStack (~23 units), and there is likely more leadershipy noise than might be observed with other types of deployments.

Unfortunately, when the jenkins build is manually terminated, the log-collection piece doesn't run (I have no juju logs for the previous runs). I've re-triggered a new run, and will manually grab logs. Thank you.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I re-ran and collected more info. The unit logs do not appear to quiesce as the juju wait plugin expects.

# Mojo version
mojo:
  Installed: 0.1.14
  Candidate: 0.1.14-1

# Started waiting @:
2015-09-16 11:29:19 [INFO] Waiting for environment to reach steady state

# Still waiting as of:
jenkins@juju-osci-machine-10:~$ date
Wed Sep 16 12:39:44 UTC 2015

# Observations:
 * juju status output appears to be normal/idle.
 * all-machines.log shows continuing leadership log activity from units.

More details @ http://paste.ubuntu.com/12426595/

Also attached machine 0's /var/log.

Revision history for this message
Stuart Bishop (stub) wrote :

There are a lot of messages here that I never see. Here is what it looks like in my local environment. In particular, there is no spam about renewing leases and lease confirmation every 30 seconds, which will certainly mess up the wait mechanism. Is this release juju 1.24, or a debug build?:

root@aargh:/var/log/juju-stub-local# grep 'leadership' *.log
all-machines.log:machine-0: 2015-09-16 09:10:49 INFO juju.state state.go:194 starting leadership manager
machine-0.log:2015-09-16 09:10:49 INFO juju.state state.go:194 starting leadership manager
unit-cassandra-0.log:2015-09-16 10:05:09 DEBUG juju.worker.leadership tracker.go:123 cassandra/0 making initial claim for cassandra leadership
unit-cassandra-0.log:2015-09-16 10:05:09 DEBUG juju.worker.leadership tracker.go:165 checking cassandra/0 for cassandra leadership
unit-cassandra-1.log:2015-09-16 10:08:50 DEBUG juju.worker.leadership tracker.go:123 cassandra/1 making initial claim for cassandra leadership
unit-cassandra-1.log:2015-09-16 10:08:50 DEBUG juju.worker.leadership tracker.go:165 checking cassandra/1 for cassandra leadership
unit-cassandra-1.log:2015-09-16 10:08:50 INFO juju.worker.leadership tracker.go:201 cassandra leadership for cassandra/1 denied
unit-cassandra-1.log:2015-09-16 10:08:50 DEBUG juju.worker.leadership tracker.go:208 cassandra/1 waiting for cassandra leadership release
unit-cassandra-2.log:2015-09-16 10:26:37 DEBUG juju.worker.leadership tracker.go:123 cassandra/2 making initial claim for cassandra leadership
unit-cassandra-2.log:2015-09-16 10:26:37 DEBUG juju.worker.leadership tracker.go:165 checking cassandra/2 for cassandra leadership
unit-cassandra-3.log:2015-09-16 10:46:55 DEBUG juju.worker.leadership tracker.go:123 cassandra/3 making initial claim for cassandra leadership
unit-cassandra-3.log:2015-09-16 10:46:55 DEBUG juju.worker.leadership tracker.go:165 checking cassandra/3 for cassandra leadership
unit-cassandra-3.log:2015-09-16 10:46:55 INFO juju.worker.leadership tracker.go:201 cassandra leadership for cassandra/3 denied
unit-cassandra-3.log:2015-09-16 10:46:55 DEBUG juju.worker.leadership tracker.go:208 cassandra/3 waiting for cassandra leadership release

Revision history for this message
Stuart Bishop (stub) wrote :

The work around should should be as simple as changing the 'tail -1 /var/log/juju/unit-{}.log' to 'grep -v juju.worker.leadership /var/log/juju/unit-{}.log | tail -1'

Revision history for this message
Ryan Beisner (1chb1n) wrote :

jenkins@juju-osci-machine-10:~$ apt-cache policy juju juju-deployer
juju:
  Installed: 1.24.5-0ubuntu1~14.04.1~juju1
  Candidate: 1.24.5-0ubuntu1~14.04.1~juju1
  Version table:
 *** 1.24.5-0ubuntu1~14.04.1~juju1 0
        500 http://ppa.launchpad.net/juju/stable/ubuntu/ trusty/main amd64 Packages
        100 /var/lib/dpkg/status
     1.22.6-0ubuntu1~14.04.1 0
        500 http://nova.clouds.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages
     1.18.1-0ubuntu1 0
        500 http://nova.clouds.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages
juju-deployer:
  Installed: 0.5.1-3
  Candidate: 0.5.1-3
  Version table:
 *** 0.5.1-3 0
        500 http://ppa.launchpad.net/juju/stable/ubuntu/ trusty/main amd64 Packages
        100 /var/lib/dpkg/status
     0.3.6-0ubuntu2 0
        500 http://nova.clouds.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages
jenkins@juju-osci-machine-10:~$

summary: - 0.1.14 deployments eternally hanging
+ juju wait can wait forever
Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI - It's Juju 1.24.5 from ppa:juju/stable.

Revision history for this message
Stuart Bishop (stub) wrote :

lp:~stub/mojo/devel contains an attempt to work around these messages.

Stuart Bishop (stub)
Changed in juju-wait:
status: New → Fix Released
assignee: nobody → Stuart Bishop (stub)
Revision history for this message
Stuart Bishop (stub) wrote :

The juju wait plugin now resets logging to default 'juju stable' settings for Juju 1.23 and earlier (juju set-environment logging-config=juju=WARNING;unit=INFO). This seems the only sane way of handling the older environments.

For Juju 1.24 and later environments, the new agent-status is used with some extra logic to avoid the worst race conditions. I think most race conditions are covered, such as the potential lull between dropping a leader and a new one being appointed. It is still wise to add a sleep after an 'upgrade-charm', as it can take a lengthy amount of time for a large charm to be distributed to the units and during this period the environment appears idle.

The plugin now also correctly handles subordinates. Testing the much faster new algorithm made this bug apparent.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Hi Stub,

2015-09-28 17:30:45 [INFO] Waiting for environment to reach steady state
2015-09-28 17:30:45 [ERROR] Unknown error
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/mojo-0.1.14.post1-py2.7.egg/mojo/cli.py", line 668, in run_with_args
    args.func(args)
  File "/usr/local/lib/python2.7/dist-packages/mojo-0.1.14.post1-py2.7.egg/mojo/cli.py", line 409, in run_from_manifest
    manifest.run(project, workspace, series, stage)
  File "/usr/local/lib/python2.7/dist-packages/mojo-0.1.14.post1-py2.7.egg/mojo/__init__.py", line 262, in run
    auto_secrets=auto_secrets)
  File "/usr/local/lib/python2.7/dist-packages/mojo-0.1.14.post1-py2.7.egg/mojo/phase.py", line 558, in run
    wait()
TypeError: wait() takes exactly 1 argument (0 given)

Looks like wait is expecting the 'log' variable?

Tom Haddon (mthaddon)
Changed in mojo:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.