Some IPMI implementations report intermittent failures
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
hw-health-charm |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Some sites running the ipmi sel checks alert frequently for errors that then self-resolve.
ipmi-sel -i reports a number of entries in the log, however in order to read them you need to run something like `ipmi-sel --tail 50` to read the last 50. ipmi-sel alone doesn't show anything apart from the last log clear.
The Nagios check for ipmi logs runs something along the lines of:
/usr/local/
If we add a -v, and run it frequently enough, we can catch some of the issues, e.g.:
IPMI Status: Critical [16 system event log (SEL) entries present - details: (System Board SEL Status = Warning, Event Logging Disabled, SEL Almost Full), (Add-in Card 102 PS Redundancy = Critical, Power Suppl y, Redundancy Lost), (System Board Riser1 Card = N/A, Add In Card, Device Inserted/Device Present), (System Board RAID Presence = N/A, Add In Card, Device Inserted/Device Present), (System Board LCD Presence = N/A, Terminator, Device Removed/Device Absent), (System Board BMC Boot Up = N/A, Microcontroller
If I then run the same check immediately after, it comes back fine.
This would be OK if NRPE were to recheck every time it's asked, however this check is run via Cron and only runs once every 5 mins. If the check returns critical, it's possible (likely) that NRPE will re-read the same file info several times and then fire an alert, to be resolved next time the cron job runs.
Related branches
- 🤖 prod-jenkaas-bootstack: Approve (continuous-integration)
- BootStack Reviewers: Pending requested
- BootStack Reviewers: Pending requested
-
Diff: 47 lines (+14/-8)1 file modifiedsrc/files/ipmi/cron_ipmi_sensors.py (+14/-8)
Changed in charm-hw-health: | |
status: | New → Fix Committed |
milestone: | none → 21.10 |
Changed in charm-hw-health: | |
status: | Fix Committed → Fix Released |