removed model can cause allmodelwatcher to die permanently
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Anastasia | ||
2.3 |
Fix Released
|
Critical
|
Anastasia |
Bug Description
Every so often, we see that the all-model watcher becomes unavailable
and only available again when the controller machine agent is restarted.
One cause of this is when there's a dead model. We see log messages like this,
repeated over and over again:
2018-01-24 15:02:32 INFO juju.state multiwatcher.go:214 store manager loop failed: model c037a410-
2018-01-24 15:02:32 INFO juju.worker runner.go:483 stopped "allmodelmanager", err: model c037a410-
2018-01-24 15:02:32 ERROR juju.worker runner.go:392 exited "allmodelmanager": model c037a410-
2018-01-24 15:02:32 INFO juju.worker runner.go:467 restarting "allmodelmanager" in 1s
When we look at the models with "juju show-models", we see that the model with that UUID
does exist, but is dead:
{
"agent-version": "2.2.9",
"cloud": "aws",
"controller-
"controller-
"life": "dead",
"model-uuid": "c037a410-
"name": "redacted@
"owner": "redacted@
"region": "eu-west-1",
"short-name": "redacted",
"sla": "unsupported",
"status": {
"current": "destroying",
"message": "tearing down cloud environment",
"since": "just now"
},
"type": "ec2",
"users": {
"admin": {
"access": "admin",
"display-name": "admin",
"last-
},
"redacted@
"access": "admin",
"display-name": "redacted",
"last-
}
}
}
It seems like allModelWatcher
returning the error when State.Get is called.
A simple fix might be to return a nil error when the cause is ErrNotFound.
It seems like the Pool retains some state on each model in memory
(PoolItem.remove) and this would explain why restarting the machine
agent fixes the issue.
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
@Roger Peppe,
Thank you for investigating this failure and such a detailed analysis in this report \o/
I do completely agree with your reasoning and believe that more comprehensive fix would be to filter out dead models from the initial query. This will prevent State.Get trying to populate model details and will fix other places that potentially cannot handle dead models. The way we handle dead models, if at all, should be an exceptional, case-by-case handling: we should never get dead models in that list.
I'll propose against 2.3 branch first and forward port to develop (heading into 2.4) once the patch lands.