api-server and http-server get stuck in "state: stopping"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Critical
|
Ian Booth |
Bug Description
In a large 2.5.0 controller, the agents can get themselves into a state where they are running but the API is unavailable. There are lots of log messages complaining that workers in the controller can't connect to the local API:
2019-01-23 08:23:09 ERROR juju.worker.
2019-01-23 08:23:09 ERROR juju.worker.
2019-01-23 08:23:10 ERROR juju.worker.
Running juju_engine_report on the controller shows both http-server and api-server have "state: stopping", which doesn't change.
Before the system goes into this bad state we can see this error in the log:
2019-01-23 07:46:17 ERROR juju.worker.
This error is caused by mongo load, particularly at startup. It causes the state worker to restart, which in turn causes other workers including http-server and api-server. They get stuck at stopping for reasons we haven't determined yet.
At the moment the mitigation for this is:
* firewall the controllers off (by blocking 17070 from non-controller machines)
* restarting the controllers
* once the connections and goroutines have stabilised for the controllers, then we can open the ports for the controller machines that aren't the mongo primary (in the hope of not causing mongo timeouts to trigger the problem).
Changed in juju: | |
assignee: | nobody → Ian Booth (wallyworld) |
status: | Triaged → In Progress |
Changed in juju: | |
status: | Fix Committed → Fix Released |
We've also seen situations where only the http-server was stuck in stopping. That seems to be caused by the raft-transport bouncing - the api-server doesn't depend on it, so it wasn't restarting, which meant that the http-server would mux.Wait indefinitely.
This PR fixes that problem (but not the api-server issue yet). /github. com/juju/ juju/pull/ 9675
https:/