Comment 11 for bug 1468653

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

After getting access to the bootstack staging environment I was able to replicate the issue. Once state server machine agents were stuck I could see that most workers were shut down, and the logs were completely quiet but strace revealed some occasional activity.

Sending a SIGQUIT to the stuck jujud process caused the Go runtime to dump out a stack trace of all the goroutines. After some investigation I noticed that there were 9 goroutines waiting for a channel to close in leadershipService.BlockUntilLeadershipReleased and exactly the same number of API server handlers trying to close, waiting for their requests to complete.

After further digging I found that whenever the lease manager exited due to error (a frequent occurrence in a busy environment it seems) it wouldn't close it's release notification channels leaving clients blocked forever.

The fix is here: https://github.com/mjs/juju/commit/14446749410ad8f8d7c83dcdc503b681e83b8884

It needs unit tests but has been manually verified by upgrading a large bootstack environment.