swift_replicator_health check needs to handle recovery case
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Swift Storage Charm |
New
|
Undecided
|
Unassigned |
Bug Description
We're encountering alerts on a customer cloud because nodes have partially caught up on replication.
The current methodology of this check (Note: I have not reviewed sources) seems to be to check syslog for the "replicated" string, and to fire warnings/critical alerts if there have been few or no matches seen in the last 15 minutes. (This is configurable.)
The problem is: once replication catches up, we seem to stop seeing this message. We see a final "Object replication complete" message and, unless additional replication becomes necessary, we don't see further messages which would keep this alert from firing.
A more sophisticated check may be required to determine whether replication is ongoing or not, and to not fire the warnings/alerts unless we are known to actively be replicating.
The default swift-object- replicator mechanism runs a daemonized loop. Once one run completes at 100%, run_pause configuration is consulted for a time to sleep between the next replication, but there will always be a replication loop that will at least audit each partition on the host to determine if any partitions have changed since the last sync and will log a completion or an update within 5 minutes.
So, the longest you should be without a "replicated" audit line in the syslog from swift-object- replicator should be 5 minutes plus $run_pause.
Here's the default from the swift config documentation:
run_pause = 30 Time in seconds to wait between replication passes
https:/ /docs.openstack .org/mitaka/ config- reference/ object- storage/ object- server. html
When investigating the status of the node that sparked this bug, I found the replicator had not been functioning for 7+ days, hung on attempting to select() from a no-longer-running child process.
The check is configurable to be disabled if you don't run the replicator as a daemon, which is not a charm-supported option, so I think this is an invalid bug.