Lenovo IPMI intermittent access issues should be able to be silenced
Bug #1876931 reported by
Drew Freiberger
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Charm Helpers |
New
|
Undecided
|
Unassigned | ||
hw-health-charm |
Won't Fix
|
Medium
|
Unassigned |
Bug Description
On some Lenovo hardware, we are seeing temporary outages of 30 minutes to 2 hours for IPMI connectivity from the host. The resultant service output is:
UNKNOWN: ipmi_sdr_
We might want to find a way to allow the charm a setting to ignore "internal IPMI error" responses related to ipmi_sdr_cache_open calls. This is not an issue determined through IPMI log/query, but an issue querying the IPMI interface of the BMC and can be noisy depending on hardware/firmware combinations for some environments.
description: | updated |
To post a comment you must log in.
Thank you Drew. I agree this issue causes alert fatigue and at the operator will, it could be decided to change the threshold of the alert.
When such error occurs, it seems wrong to return:
1) OK: there is something wrong if it lasts forever
2) WARNING: it is not related to the hardware but to the IPMI interface
I would suggest to implement a clock (time in seconds configurable via Juju) to monitor for how long the "internal IPMI error" message is returned. It could then happen that 2 hours in a row is OK for the check to return OK, but more than that would trigger: cache_open: /root/. freeipmi/ sdr-cache/ sdr-cache- ${hostname_ fqdn}: internal IPMI error
"""
UNKNOWN: Repeated for {{time}} seconds. ipmi_sdr_
"""
Whenever a different message is returned, the clock is reset.
Would you agree on this approach?