juju agent needs to be more resilient
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned |
Bug Description
Multiple bugs (most recently lp: #1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance.
The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts.
The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds).
Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/
Thanks, I can't disagree with what you've got here. It's something we should get into some discussions and see about plotting out a proper path for some self-healing around this for sure.