Looking at the logs, the controller was *very* unhappy before the upgrade triggered. The log file starts with:
machine-log pre upgrade
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/972ef575-2d51-4a97-85c6-34b9bfc272ac/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/d8d6172f-2e4b-4456-8feb-3def756c2300/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/403fcc26-10a9-4a01-8853-6b2d2dc170b6/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/393e0299-2f17-4608-8e66-5c3bdf41d47e/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f7dd66cc-63ab-4373-8b9c-ef4450bba54c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f524d1a3-2c43-4fda-81b0-a90839c413bf/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/664c4806-8e91-4a36-832f-97e31679d697/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/34445fff-8911-4fb3-879e-484e21ef4352/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/0554bee5-31b3-4dd3-8404-db1b0e68398c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/69f7e364-670e-44ef-898b-b5492ea4f308/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/4d8861b0-adf6-4010-82e0-978c8fc58dc5/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/1ce99b16-50be-4930-8221-e782cc942463/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/84538afb-d225-43a2-8fd9-76c5c6661d03/api"
...
Which looks to be lots of workers just starting to connect to the controller. (Indicating they had either recently started, or been restarted, or something).
It then follows with a lot of messages like:
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
That would hint that *some* of these are getting rate limited.
Then we see a huge number of these failures:
2018-01-11 02:19:49 ERROR juju.api.watcher watcher.go:90 error trying to stop watcher: txn watcher restarted
The core txn watcher restarting is a rather serious event, and probably does cause a lot of internal churn.
The "failed to stop watcher" feels very weird. Why would we get an error failing to stop the watcher? Maybe our teardown is getting an error while tearing down and failing to finish cleaning up what it can?
A couple minutes later we're still getting rate limited:
2018-01-11 02:21:30 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
And then we get the first "Closed explicitly"
2018-01-11 02:21:31 ERROR juju.worker.dependency engine.go:551 "compute-provisioner" manifold worker returned unexpected error: could not get controller info: cannot get controllers document: Closed explicitly
Mostly, the controller appears to be thrashing itself during various tear-down activities, which may be why it wasn't able to see that an upgrade was pending? I'm not sure because it does register this at the very end of the file:
2018-01-11 02:25:52 INFO juju.worker.upgrader error.go:33 upgraded from 2.2.8-xenial-amd64 to 2.2.9-xenial-amd64 ("https://10.25.2.109:17070/model/4da59b22-9710-4e69-840a-be49ee864a97/tools/2.2.9-xenial-amd64")
ERROR must restart: an agent upgrade is available
It may also be that machine 0 and 2 noticed earlier that an upgrade had triggered, and then whatever workers on machine-1 that were connected to 0 and 2 were getting restarted, triggering load, and preventing it from upgrading itself?
We should rethink how model agents are working, because they shouldn't be connecting to other controllers. If an worker for model X is running on machine-1, it should only use its connection to localhost to process that model.
Looking at the logs, the controller was *very* unhappy before the upgrade triggered. The log file starts with: localhost: 17070/model/ 972ef575- 2d51-4a97- 85c6-34b9bfc272 ac/api" 10.25.2. 111:17070/ model/d8d6172f- 2e4b-4456- 8feb-3def756c23 00/api" localhost: 17070/model/ 403fcc26- 10a9-4a01- 8853-6b2d2dc170 b6/api" localhost: 17070/model/ 393e0299- 2f17-4608- 8e66-5c3bdf41d4 7e/api" 10.25.2. 111:17070/ model/f7dd66cc- 63ab-4373- 8b9c-ef4450bba5 4c/api" 10.25.2. 111:17070/ model/f524d1a3- 2c43-4fda- 81b0-a90839c413 bf/api" localhost: 17070/model/ 664c4806- 8e91-4a36- 832f-97e31679d6 97/api" localhost: 17070/model/ 34445fff- 8911-4fb3- 879e-484e21ef43 52/api" 10.25.2. 109:17070/ model/0554bee5- 31b3-4dd3- 8404-db1b0e6839 8c/api" localhost: 17070/model/ 69f7e364- 670e-44ef- 898b-b5492ea4f3 08/api" localhost: 17070/model/ 4d8861b0- adf6-4010- 82e0-978c8fc58d c5/api" localhost: 17070/model/ 1ce99b16- 50be-4930- 8221-e782cc9424 63/api" 10.25.2. 109:17070/ model/84538afb- d225-43a2- 8fd9-76c5c6661d 03/api"
machine-log pre upgrade
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://
...
Which looks to be lots of workers just starting to connect to the controller. (Indicating they had either recently started, or been restarted, or something). dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again) dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
It then follows with a lot of messages like:
2018-01-11 02:19:46 ERROR juju.worker.
2018-01-11 02:19:46 ERROR juju.worker.
That would hint that *some* of these are getting rate limited.
Then we see a huge number of these failures:
2018-01-11 02:19:49 ERROR juju.api.watcher watcher.go:90 error trying to stop watcher: txn watcher restarted
The core txn watcher restarting is a rather serious event, and probably does cause a lot of internal churn.
The "failed to stop watcher" feels very weird. Why would we get an error failing to stop the watcher? Maybe our teardown is getting an error while tearing down and failing to finish cleaning up what it can?
A couple minutes later we're still getting rate limited: dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
2018-01-11 02:21:30 ERROR juju.worker.
And then we get the first "Closed explicitly" dependency engine.go:551 "compute- provisioner" manifold worker returned unexpected error: could not get controller info: cannot get controllers document: Closed explicitly
2018-01-11 02:21:31 ERROR juju.worker.
Mostly, the controller appears to be thrashing itself during various tear-down activities, which may be why it wasn't able to see that an upgrade was pending? I'm not sure because it does register this at the very end of the file: upgrader error.go:33 upgraded from 2.2.8-xenial-amd64 to 2.2.9-xenial-amd64 ("https:/ /10.25. 2.109:17070/ model/4da59b22- 9710-4e69- 840a-be49ee864a 97/tools/ 2.2.9-xenial- amd64")
2018-01-11 02:25:52 INFO juju.worker.
ERROR must restart: an agent upgrade is available
It may also be that machine 0 and 2 noticed earlier that an upgrade had triggered, and then whatever workers on machine-1 that were connected to 0 and 2 were getting restarted, triggering load, and preventing it from upgrading itself?
We should rethink how model agents are working, because they shouldn't be connecting to other controllers. If an worker for model X is running on machine-1, it should only use its connection to localhost to process that model.