Busy controllers seem to become unstable at intervals, requiring an mgopurge run to recover
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
At intervals (in our case, roughly every two months), we're seeing degraded service on an HA three-node controller cluster.
Internal API services are seen to shut down on the controller machine agents, they generally come back after a manual restart but then fail again later.
Taking the controllers offline to run a full `mgopurge` restores stable service.
This bug is to track figuring out exactly what mgopurge is removing/repairing from the MongoDB store when this happens, so that we can hopefully stop it from occurring in the first place.
Feel free to mark this Incomplete for now.
The most recent such incident was a month ago, and we don't have backups to compare the state of mongo before and after. But we'll add them to this bug and reopen if/when it happens again.
Marking incomplete, as per description. Definitely interested to see the logs when we get them!