Comment 2 for bug 1488777

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Opinionated assertions follow. :-)

I think this is important in at least two arenas:

 - Production deployment automation where post-deployment activities need to take place; and

 - Test automation, where the deployed model needs to be inspected post-deployment.

The impact of not having some sort of approach in place is that testing begins too early, introducing race conditions where a test may pass when the weather is good, but may start to fail when the substrate is under load, or the internet is slower, or any other variable impacts the timing of things.

We too (OpenStack Engineering) have struggled over time, and believe we have conquered the art of systematically detecting when a Juju deployment is actually "done." We did this by implementing extended status messaging into the charms, allowing the charm declare itself "ready." When all units in a model do so, we proceed. We've found that all other approaches leave varying degrees of raciness.

We've found that in cases where not all charms possess intentional extended status advertising, the juju-wait plugin almost always reliably waits the necessary amount of time for the deployment to complete. Even with that, we've found a couple of gaps and currently have a juju-wait fork and corresponding merge proposal to address those.

Unfortunately at this time, there is no global uniform extended status message to watch across all charms.
We've implemented a predictable extended status message in the OpenStack charms, but that message is perhaps different than other charms. This means that currently, one approach may not translate well to another type of workload.

Some observations, opinions, mixed with some facts:

`juju deploy foo` exits 0 nearly immediately, so that cannot be used to block/wait.

juju-deployer exits 0, many minutes (2 to 35min in practice) before the OpenStack deployments are actually done.

Amulet's wait logic has grown to be closer to perfect in waiting for things to wrap up, but we still observe test races when using that alone, especially when subordinates are involved.

Amulet's "wait for extended status" logic is solid, if the charm is written to declare itself ready.

For legacy charms which do not / will not have extended status, juju-wait is nearly perfect in predicting readiness.

...

To sum up:

1. juju-wait is the closest thing to a global, generically usable way to block/wait for deployment readiness that I have seen and used.

2. Extended status is the only guaranteed way to block/wait for a deployment to complete, presuming that the charm author's logic checks that the required relations are met, deployed services are up, processes are running and any expected network sockets are bound, listening and responsive -- before declaring itself ready.

3. Since a typical OpenStack deployment includes two charms which do not have extended status (mysql and mongodb), we do both [1] and [2], and that successfully avoids the "Am-I-really-ready?" races we used to battle. Now we are free to chase more meaningful races in the deployed workloads. ;-)

4. Sleep not. If you find yourself about to add time.sleep() anywhere in anything outside of a retry loop, you probably shouldn't. It will eventually race.

Note: this is all in the context of Juju 1.x and related tooling as of this date. We're still evaluating how it changes, if at all, in the new and exciting Juju 2.x world.