unit destruction depends on unit agents
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Fix Released
|
Critical
|
William Reade |
Bug Description
The command sequence:
$ juju deploy foo
$ juju destroy-service foo
...can have surprising consequences, as follows:
* foo/0 will persist, apparently "alive", until it's deployed; only at that point will it
be destroyed (because the unit agent checks the service).
* if a new machine were created to hold foo/0, and provisioning failed for that machine,
the unit will never be destroyed (except manually, via `destroy-unit`).
* if the unit agent was previously running, but the machine agent went away unexpectedly,
the unit can never be destroyed at all (lp:1089289).
In all these cases, the impact is that the foo service gets "stuck" for longer than it should, waiting on the unit agent. By slightly tweaking service destruction, we can automatically destroy all units; this will trigger existing short-circuit paths and resolve the first two consequences, and move us a step towards a simple fix for the third.
Related branches
- Juju Engineering: Pending requested
-
Diff: 185 lines (+107/-18)2 files modifiedcmd/jujud/machine.go (+31/-0)
cmd/jujud/machine_test.go (+76/-18)
description: | updated |
description: | updated |
description: | updated |
Changed in juju-core: | |
status: | Confirmed → Triaged |
importance: | Undecided → Critical |
assignee: | nobody → William Reade (fwereade) |
description: | updated |
Changed in juju-core: | |
status: | Triaged → In Progress |
Changed in juju-core: | |
milestone: | none → dev-docs |
Changed in juju-core: | |
milestone: | dev-docs → 1.11.3 |
Changed in juju-core: | |
milestone: | 1.11.3 → 1.11.4 |
Changed in juju-core: | |
milestone: | 1.11.4 → 1.11.5 |
Changed in juju-core: | |
status: | In Progress → Fix Committed |
Changed in juju-core: | |
status: | Fix Committed → Fix Released |
tags: | added: landscape |
This is partly a communication issue -- it's intending to say something like "I didn't do anything, because the flag I'd be setting is already set"; and the problem is that the unit agent, because it's not running, can't respond to that flag and advance the lifecycle.
So, that's definitely a problem, and we need --force flags on destroy-machine and destroy-unit (lp:1089291 and lp:1089289), that will cause some other part of the system to take over the appropriate responsibilities and tidy up the entities correctly.
Longer-term, this issue emphasizes the value of a storage management system that could let us migrate unit and machine state onto fresh hardware; but that's not on the cards in the immediate future.
It is correct that, once the instance is unrecoverable (what happened to it, btw?), the only way to remove that machine and unit (and the unit's service, and any of its relations the unit had joined...) is to destroy the whole environment. But in practice the *environment* itself should not be in trouble -- unless you lose the bootstrap instance, ofc -- and you should be able to continue to interact with other entities without difficulty. I presume the biggest problem is being unable to reuse service names, but I may be misunderstanding your use case... or unaware of additional problems triggered by this situation?