After getting access to the bootstack staging environment I was able to replicate the issue. Once state server machine agents were stuck I could see that most workers were shut down, and the logs were completely quiet but strace revealed some occasional activity.
Sending a SIGQUIT to the stuck jujud process caused the Go runtime to dump out a stack trace of all the goroutines. After some investigation I noticed that there were 9 goroutines waiting for a channel to close in leadershipService.BlockUntilLeadershipReleased and exactly the same number of API server handlers trying to close, waiting for their requests to complete.
After further digging I found that whenever the lease manager exited due to error (a frequent occurrence in a busy environment it seems) it wouldn't close it's release notification channels leaving clients blocked forever.
After getting access to the bootstack staging environment I was able to replicate the issue. Once state server machine agents were stuck I could see that most workers were shut down, and the logs were completely quiet but strace revealed some occasional activity.
Sending a SIGQUIT to the stuck jujud process caused the Go runtime to dump out a stack trace of all the goroutines. After some investigation I noticed that there were 9 goroutines waiting for a channel to close in leadershipServi ce.BlockUntilLe adershipRelease d and exactly the same number of API server handlers trying to close, waiting for their requests to complete.
After further digging I found that whenever the lease manager exited due to error (a frequent occurrence in a busy environment it seems) it wouldn't close it's release notification channels leaving clients blocked forever.
The fix is here: https:/ /github. com/mjs/ juju/commit/ 14446749410ad8f 8d7c83dcdc503b6 81e83b8884
It needs unit tests but has been manually verified by upgrading a large bootstack environment.