So Ian and I spent some time digging through the logs again this morning. I think we've worked out that the issue is likely because of a charm requesting a reboot, and things like raft errors are a red herring. Specifically, these log lines: 2018-12-21 06:07:48 INFO juju.cmd supercommand.go:57 running jujud [2.5-rc1 gc go1.11.3] 2018-12-21 06:07:48 DEBUG juju.cmd supercommand.go:58 args: []string{"/var/lib/juju/tools/machine-0/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "0", "--debug"} ... 2018-12-21 06:07:50 INFO juju.worker.httpserver worker.go:258 listening on "[::]:17070" 2018-12-21 06:07:51 INFO juju.apiserver.connection request_notifier.go:96 agent login: machine-14 for bbb742e7-2832-457c-8dac-a713e42780ee 2018-12-21 06:07:51 INFO juju.apiserver.common password.go:105 setting password for "machine-14u 2018-12-21 06:07:51 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-unattended-0 for bbb742e7-2832-457c-8dac-a713e42780ee 2018-12-21 06:07:51 INFO juju.provider.openstack provider.go:162 opening model "ntp-lcy01" 2018-12-21 06:07:51 INFO juju.apiserver.common password.go:105 setting password for "unit-unattended-0" ... 2018-12-21 06:07:54 INFO juju.worker.apicaller connect.go:158 [bbb742] "machine-0" successfully connected to "localhost:17070" 2018-12-21 06:07:54 INFO juju.provider.openstack provider.go:162 opening model "ntp-lcy01" 2018-12-21 06:07:54 INFO juju.provider.openstack provider.go:162 opening model "canonistack1" 2018-12-21 06:07:54 WARNING juju.environs.config config.go:1570 unknown config field "rsyslog-ca-cert" 2018-12-21 06:07:54 WARNING juju.environs.config config.go:1570 unknown config field "rsyslog-ca-key" 2018-12-21 06:07:54 INFO juju.worker.stateconfigwatcher manifold.go:119 tomb dying 2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-unattended-0 for bbb742e7-2832-457c-8dac-a713e42780ee 2018-12-21 06:07:54 INFO juju.worker.apicaller connect.go:158 [1ec998] "machine-0" successfully connected to "localhost:17070" ... 2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-unattended-0 for bbb742e7-2832-457c-8dac-a713e42780ee 2018-12-21 06:07:54 INFO juju.worker.apicaller connect.go:158 [1ec998] "machine-0" successfully connected to "localhost:17070" 2018-12-21 06:07:54 INFO juju.provider.openstack provider.go:162 opening model "controller" 2018-12-21 06:07:54 INFO juju.worker.authenticationworker worker.go:103 "machine-0" key updater worker started 2018-12-21 06:07:54 INFO juju.worker.upgradeseries worker.go:164 no series upgrade lock present 2018-12-21 06:07:54 ERROR juju.worker.logger logger.go:63 connection is shut down ... 2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-ubuntu-lite-0 for 1fc7e89f-9afa-4559-8d15-a03cc9835901 2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-landscape-client-5 for 1ec99855-c41e-44c7-8e9c-b51e4cdcb3d3 2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-ubuntu-0 for bbb742e7-2832-457c-8dac-a713e42780ee 2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-landscape-client-14 for 1ec99855-c41e-44c7-8e9c-b51e4cdcb3d3 2018-12-21 06:07:54 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-telegraf-2 for 1ec99855-c41e-44c7-8e9c-b51e4cdcb3d3 ... 2018-12-21 06:07:56 INFO juju.worker.certupdater certupdater.go:180 controller certificate addresses updated to ["10.55.60.14" "127.0.0.1" "::1" "anything" "juju-apiserver" "juju-mongodb" "localhost"] 2018-12-21 06:07:56 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: machine-0 for 1fc7e89f-9afa-4559-8d15-a03cc9835901 2018-12-21 06:07:56 INFO juju.worker.httpserver worker.go:164 shutting down HTTP server 2018-12-21 06:07:57 INFO juju.core.raftlease store.go:229 timeout 2018-12-21 06:07:57 WARNING juju.worker.lease.raft manager.go:244 [3b4318] retrying timed out while handling claim 2018-12-21 06:07:57 INFO juju.core.raftlease store.go:229 timeout 2018-12-21 06:07:57 WARNING juju.worker.lease.raft manager.go:244 [3b4318] retrying timed out while handling claim 2018-12-21 06:07:57 INFO juju.core.raftlease store.go:229 timeout 2018-12-21 06:07:57 WARNING juju.worker.lease.raft manager.go:244 [3b4318] retrying timed out while handling claim 2018-12-21 06:07:57 INFO juju.core.raftlease store.go:229 timeout 2018-12-21 06:07:57 WARNING juju.worker.lease.raft manager.go:244 [3b4318] retrying timed out while handling claim 2018-12-21 06:07:59 INFO juju.core.raftlease store.go:229 timeout 2018-12-21 06:07:59 WARNING juju.worker.lease.raft manager.go:244 [3b4318] retrying timed out while handling claim 2018-12-21 06:07:59 INFO juju.cmd.jujud machine.go:492 Caught reboot error 2018-12-21 06:07:59 INFO juju.cmd.jujud machine.go:621 Reboot: Error connecting to state ERROR dial tcp 127.0.0.1:17070: connect: connection refused The key that we're running a "unit-unattended-*" makes me think we have unattended upgrades on the machine and has a charm that says "if apt says you need to reboot, trigger a juju-reboot". And the bug is in those last 3 lines. Namely, it looks like the Juju code that handles rebooting needs to connect to the Juju controller (to unset the reboot flag right before it triggers the machine reboot). cmd/jujud/agent/machine.go 614: func (a *MachineAgent) executeRebootOrShutdown(action params.RebootAction) error { // At this stage, all API connections would have been closed // We need to reopen the API to clear the reboot flag after // scheduling the reboot. It may be cleaner to do this in the reboot // worker, before returning the ErrRebootMachine. conn, err := apicaller.OnlyConnect(a, api.Open) if err != nil { logger.Infof("Reboot: Error connecting to state") return errors.Trace(err) } My guess is that we messed up ordering with the changes to have the HTTP worker shutdown cleanly. Specifically, ErrRebootMachine is considered "Fatal" which means we shut down everything. However, the code that handles it, assumes that the API is available for us to connect and let it know that we're rebooting. But since we've already shut down the API server/HTTP handler by the time we get there, the worker is only triggering a restart of the Jujud agent, and not a full reboot of the machine. (Certainly the timestamps say that if we are rebooting, we're doing so in less than 1s) And because we didn't restart the machine, the unattended charm comes back up and once it fires, it tells us that we need to restart, and we end up in an infinite loop. We believe that simply issuing "sudo shutdown -r now" on the machine will get it back to being happy, because that will cause the charm to stop thinking it needs a reboot. Now, we need to fix the reboot logic as well, but there should be a quick fix for getting it working, so we can test if there is anything else wrong. The "Raft" errors we are seeing are simply that we just shut down the HTTP server as part of restarting, and the raft engine notices that it can't talk to the HTTP server anymore. Likely we should have a cleaner shutdown, where if we're initiating a shutdown, we don't put the same errors/warnings into the log.