[3.1.6] Non leader controller agents lost unexpectedly
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
High
|
Unassigned |
Bug Description
Running juju 3.1.6 all the juju controllers start in HA, and the non leader units start, after about 2 hours though, the leader election starts to fail, and the 2 non leaders go down (and stay down)
from the leader unit we see:
2023-09-25 08:20:30 ERROR juju.worker.
2023-09-25 08:20:30 INFO juju.worker.uniter uniter.go:347 unit "controller/0" shutting down: catacomb 0xc004cf4480 is dying
2023-09-25 08:20:30 DEBUG juju.worker.uniter runlistener.go:130 juju-exec listener stopping
2023-09-25 08:20:30 DEBUG juju.worker.uniter runlistener.go:149 juju-exec listener stopped
2023-09-25 08:20:30 DEBUG juju.worker.
2023-09-25 08:20:30 DEBUG juju.worker.
2023-09-25 08:20:33 DEBUG juju.worker.
2023-09-25 08:20:33 DEBUG juju.worker.
2023-09-25 08:20:33 DEBUG juju.worker.uniter uniter.go:932 starting local juju-exec listener on {unix /var/lib/
2023-09-25 08:20:33 INFO juju.worker.uniter uniter.go:363 unit "controller/0" started
2023-09-25 08:20:33 DEBUG juju.worker.uniter runlistener.go:117 juju-exec listener running
2023-09-25 08:20:33 INFO juju.worker.uniter uniter.go:389 hooks are retried false
2023-09-25 08:21:23 DEBUG juju.worker.
2023-09-25 08:21:23 DEBUG juju.worker.
2023-09-25 08:21:23 DEBUG juju.worker.
2023-09-25 08:21:23 DEBUG juju.worker.
stack trace:
lease operation timed out
github.
github.
at the same time the other controllers say:
2023-09-25 08:17:48 DEBUG juju.worker.
stack trace:
github.
github.
github.
I havent been able to pick out what caused the worker to stop and fail to start again.
the testrun can be found at:
https:/
and the controller crashdump can be found at:
https:/
summary: |
- [3.1.6] Non leader controller agents lost as soon as they start + [3.1.6] Non leader controller agents lost unexpectedly |
All failed runs could be found from https:/ /solutions. qa.canonical. com/bugs/ 2037308