Canonical Juju

Bug #1742604
Comment #4

Comment 4 for bug 1742604

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-13:

Looking at the logs, the controller was *very* unhappy before the upgrade triggered. The log file starts with:
machine-log pre upgrade
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/972ef575-2d51-4a97-85c6-34b9bfc272ac/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/d8d6172f-2e4b-4456-8feb-3def756c2300/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/403fcc26-10a9-4a01-8853-6b2d2dc170b6/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/393e0299-2f17-4608-8e66-5c3bdf41d47e/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f7dd66cc-63ab-4373-8b9c-ef4450bba54c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f524d1a3-2c43-4fda-81b0-a90839c413bf/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/664c4806-8e91-4a36-832f-97e31679d697/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/34445fff-8911-4fb3-879e-484e21ef4352/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/0554bee5-31b3-4dd3-8404-db1b0e68398c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/69f7e364-670e-44ef-898b-b5492ea4f308/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/4d8861b0-adf6-4010-82e0-978c8fc58dc5/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/1ce99b16-50be-4930-8221-e782cc942463/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/84538afb-d225-43a2-8fd9-76c5c6661d03/api"
...

Which looks to be lots of workers just starting to connect to the controller. (Indicating they had either recently started, or been restarted, or something).
It then follows with a lot of messages like:
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)

That would hint that *some* of these are getting rate limited.

Then we see a huge number of these failures:
2018-01-11 02:19:49 ERROR juju.api.watcher watcher.go:90 error trying to stop watcher: txn watcher restarted

The core txn watcher restarting is a rather serious event, and probably does cause a lot of internal churn.

The "failed to stop watcher" feels very weird. Why would we get an error failing to stop the watcher? Maybe our teardown is getting an error while tearing down and failing to finish cleaning up what it can?

A couple minutes later we're still getting rate limited:
2018-01-11 02:21:30 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)

And then we get the first "Closed explicitly"
2018-01-11 02:21:31 ERROR juju.worker.dependency engine.go:551 "compute-provisioner" manifold worker returned unexpected error: could not get controller info: cannot get controllers document: Closed explicitly

Mostly, the controller appears to be thrashing itself during various tear-down activities, which may be why it wasn't able to see that an upgrade was pending? I'm not sure because it does register this at the very end of the file:
2018-01-11 02:25:52 INFO juju.worker.upgrader error.go:33 upgraded from 2.2.8-xenial-amd64 to 2.2.9-xenial-amd64 ("https://10.25.2.109:17070/model/4da59b22-9710-4e69-840a-be49ee864a97/tools/2.2.9-xenial-amd64")
ERROR must restart: an agent upgrade is available

It may also be that machine 0 and 2 noticed earlier that an upgrade had triggered, and then whatever workers on machine-1 that were connected to 0 and 2 were getting restarted, triggering load, and preventing it from upgrading itself?

We should rethink how model agents are working, because they shouldn't be connecting to other controllers. If an worker for model X is running on machine-1, it should only use its connection to localhost to process that model.

That would hint that *some* of these are getting rate limited.

Then we see a huge number of these failures:
2018-01-11 02:19:49 ERROR juju.api.watcher watcher.go:90 error trying to stop watcher: txn watcher restarted

The core txn watcher restarting is a rather serious event, and probably does cause a lot of internal churn.