hw-health-charm

Some IPMI implementations report intermittent failures

Bug #1945151 reported by Xav Paice on 2021-09-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	hw-health-charm	Fix Released	Undecided	Unassigned	hw-health-charm 21.10

Bug Description

Some sites running the ipmi sel checks alert frequently for errors that then self-resolve.

ipmi-sel -i reports a number of entries in the log, however in order to read them you need to run something like `ipmi-sel --tail 50` to read the last 50. ipmi-sel alone doesn't show anything apart from the last log clear.

The Nagios check for ipmi logs runs something along the lines of:

/usr/local/lib/nagios/plugins/check_ipmi_sensor --sexclude /var/lib/nagios/ipmi_exclude --selexclude /var/lib/nagios/sel_exclude

If we add a -v, and run it frequently enough, we can catch some of the issues, e.g.:

IPMI Status: Critical [16 system event log (SEL) entries present - details: (System Board SEL Status = Warning, Event Logging Disabled, SEL Almost Full), (Add-in Card 102 PS Redundancy = Critical, Power Suppl y, Redundancy Lost), (System Board Riser1 Card = N/A, Add In Card, Device Inserted/Device Present), (System Board RAID Presence = N/A, Add In Card, Device Inserted/Device Present), (System Board LCD Presence = N/A, Terminator, Device Removed/Device Absent), (System Board BMC Boot Up = N/A, Microcontroller/Coprocessor, Device Enabled), (System Board NIC1 Presence = N/A, Add In Card, Device Inserted/Device Present) , (Connectivity Switch 97 Port2 Link Down = Warning, Slot/Connector, Slot is Disabled), (Connectivity Switch 97 Port3 Link Down = Warning, Slot/Connector, Slot is Disabled), (Connectivity Switch 97 Port4 Link Down = Warning, Slot/Connector, Slot is Disabled), (Connectivity Switch 96 LOM P1 Link Down = Warning, Slot/Connector, Slot is Disabled), (Connectivity Switch 96 LOM P2 Link Down = Warning, Slot/Connector, S lot is Disabled), (Connectivity Switch 96 LOM P4 Link Down = Warning, Slot/Connector, Slot is Disabled), (Connectivity Switch 97 Port1 Link Down = Warning, Slot/Connector, Slot is Disabled), (Connectivity Swi tch 97 Port1 Link Down = Warning, Slot/Connector, Slot is Disabled), (Connectivity Switch 97 Port1 Link Down = Warning, Slot/Connector, Slot is Disabled) - fix the reported issues and clear your SEL or exclud e specific SEL entries using the -sx or -xST option] | 'Inlet Temp'=23.00;~:46.00;~:48.00 'Outlet Temp'=35.00;~:75.00; 'PCH Temp'=59.00;~:86.00; 'CPU1 Core Rem'=35.00 'CPU2 Core Rem'=37.00 'CPU1 DTS'=-60.00;~ :-1.00; 'CPU2 DTS'=-57.00;~:-1.00; 'Cpu1 Margin'=-50.00 'Cpu2 Margin'=-48.00 'CPU1 MEM Temp'=35.00;~:95.00; 'CPU2 MEM Temp'=37.00;~:95.00; 'SYS 3.3V'=3.32;;2.96:3.62 'SYS 5V'=5.13;;4.50:5.49 'SYS 12V_1'=12.24 ;;10.80:13.20 'SYS 12V_2'=12.24;;10.80:13.20 'CPU1 DDR VPP1'=2.58;;2.24:2.74 'CPU1 DDR VPP2'=2.56;;2.24:2.74 'CPU2 DDR VPP1'=2.58;;2.24:2.74 'CPU2 DDR VPP2'=2.56;;2.24:2.74 'FAN1 Speed'=3600.00 'FAN2 Speed'=3 600.00 'FAN3 Speed'=3480.00 'FAN4 Speed'=3480.00 'Power'=228.00 'Disks Temp'=34.00 'RAID Temp'=63.00;~:105.00; 'Raid BBU Temp'=27.00;~:65.00; 'Power1'=132.00 'PS1 VIN'=52.00 'PS1 Inlet Temp'=33.00 'Power2'=10 8.00 'PS2 VIN'=52.00 'PS2 Inlet Temp'=32.00 'CPU1 VCore'=1.79;;1.23:2.04 'CPU2 VCore'=1.79;;1.23:2.04 'CPU1 DDR VDDQ'=1.22;;1.14:1.26 'CPU1 DDR VDDQ2'=1.22;;1.14:1.26 'CPU2 DDR VDDQ'=1.22;;1.14:1.26 'CPU2 DDR VDDQ2'=1.22;;1.14:1.26 'CPU1 VDDQ Temp'=32.00;~:120.00; 'CPU2 VDDQ Temp'=36.00;~:120.00; 'CPU1 VRD Temp'=37.00;~:120.00; 'CPU2 VRD Temp'=37.00;~:120.00; 'CPU1 VSA'=0.87;;0.45:1.21 'CPU2 VSA'=0.87;;0.45:1.21 'CPU1 VCCIO'=0.99;;0.84:1.16 'CPU2 VCCIO'=0.99;;0.84:1.16 'PCH VPVNN'=0.99;;0.73:1.15 'PCH PRIM 1V05'=1.04;;0.91:1.19 'PCIe RAID1 Temp'=63.00;~:105.00;

If I then run the same check immediately after, it comes back fine.

This would be OK if NRPE were to recheck every time it's asked, however this check is run via Cron and only runs once every 5 mins. If the check returns critical, it's possible (likely) that NRPE will re-read the same file info several times and then fire an alert, to be resolved next time the cron job runs.