Canonical Juju

api-server and http-server get stuck in "state: stopping"

Bug #1813261 reported by Christian Muirhead on 2019-01-25

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Ian Booth	Canonical Juju 2.5.1

Bug Description

In a large 2.5.0 controller, the agents can get themselves into a state where they are running but the API is unavailable. There are lots of log messages complaining that workers in the controller can't connect to the local API:

2019-01-23 08:23:09 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [bfe052] "machine-2" cannot open api: unable to connect to API: read tcp 127.0.0.1:38658->127.0.0.1:17071: i/o timeout
2019-01-23 08:23:09 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [f4830b] "machine-2" cannot open api: unable to connect to API: read tcp 127.0.0.1:38660->127.0.0.1:17071: i/o timeout
2019-01-23 08:23:10 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [129152] "machine-2" cannot open api: unable to connect to API: read tcp 127.0.0.1:38666->127.0.0.1:17071: i/o timeout

Running juju_engine_report on the controller shows both http-server and api-server have "state: stopping", which doesn't change.

Before the system goes into this bad state we can see this error in the log:
2019-01-23 07:46:17 ERROR juju.worker.dependency engine.go:632 "state" manifold worker returned unexpected error: state ping failed: read tcp 127.0.0.1:59338->127.0.0.1:37017: i/o timeout

This error is caused by mongo load, particularly at startup. It causes the state worker to restart, which in turn causes other workers including http-server and api-server. They get stuck at stopping for reasons we haven't determined yet.

At the moment the mitigation for this is:
* firewall the controllers off (by blocking 17070 from non-controller machines)
* restarting the controllers
* once the connections and goroutines have stabilised for the controllers, then we can open the ports for the controller machines that aren't the mongo primary (in the hope of not causing mongo timeouts to trigger the problem).

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2019-01-28:

We've also seen situations where only the http-server was stuck in stopping. That seems to be caused by the raft-transport bouncing - the api-server doesn't depend on it, so it wasn't restarting, which meant that the http-server would mux.Wait indefinitely.

This PR fixes that problem (but not the api-server issue yet).
https://github.com/juju/juju/pull/9675

Revision history for this message

Uros Jovanovic (uros-jovanovic) wrote on 2019-01-28:

Same error here: ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [a7f617] "machine-2" cannot open api: unable to connect to API: dial tcp 127.0.0.1:443: connect: connection refused

First it was just the mongodb connection failure, but restarting jujud eneded up with the error message above.

Note that this controller was mostly idle, not much traffic at all.

Ian Booth (wallyworld) on 2019-01-29

Changed in juju:
assignee:	nobody → Ian Booth (wallyworld)
status:	Triaged → In Progress

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-29:

Hard to track down the cause - all workers which use the http server mux which could block exit are supposed to close when the apiserver dying channel is closed. Perhaps some workers are blocked on reads.

We'll land a PR to exit the http server after 1 minute of waitong with the hope that closing the sockets will unblock any readers and they will also exit. In such a case, we dump the goroutines for further analysis.

https://github.com/juju/juju/pull/9684

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-01-29:

I'll mark as fix committed pending an observed re-occurrence now that a stuck http-server work is terminated.

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2019-02-11:

It seems like this could be caused by this bug: https://bugs.launchpad.net/juju/+bug/1815397

If so it is fixed in: https://github.com/juju/juju/pull/9730

Canonical Juju QA Bot (juju-qa-bot) on 2019-02-15

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.