Charm Helpers

Lenovo IPMI intermittent access issues should be able to be silenced

Bug #1876931 reported by Drew Freiberger on 2020-05-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Charm Helpers	New	Undecided	Unassigned
	hw-health-charm	Won't Fix	Medium	Unassigned

Bug Description

On some Lenovo hardware, we are seeing temporary outages of 30 minutes to 2 hours for IPMI connectivity from the host. The resultant service output is:

UNKNOWN: ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-${hostname_fqdn}: internal IPMI error

We might want to find a way to allow the charm a setting to ignore "internal IPMI error" responses related to ipmi_sdr_cache_open calls. This is not an issue determined through IPMI log/query, but an issue querying the IPMI interface of the BMC and can be noisy depending on hardware/firmware combinations for some environments.

See original description

Drew Freiberger (afreiberger) on 2020-05-05

description:

updated

Revision history for this message

Alvaro Uria (aluria) wrote on 2020-05-05:

Thank you Drew. I agree this issue causes alert fatigue and at the operator will, it could be decided to change the threshold of the alert.

When such error occurs, it seems wrong to return:
1) OK: there is something wrong if it lasts forever
2) WARNING: it is not related to the hardware but to the IPMI interface

I would suggest to implement a clock (time in seconds configurable via Juju) to monitor for how long the "internal IPMI error" message is returned. It could then happen that 2 hours in a row is OK for the check to return OK, but more than that would trigger:
"""
UNKNOWN: Repeated for {{time}} seconds. ipmi_sdr_cache_open: /root/.freeipmi/sdr-cache/sdr-cache-${hostname_fqdn}: internal IPMI error
"""

Whenever a different message is returned, the clock is reset.

Would you agree on this approach?

Changed in charm-hw-health:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-05-06:

I think if we're going to have an "alarm after X hours" setting, we should just try to set that as the alert threshold within nagios when we add the check rather than coding it into the check script itself.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-05-06:

max_check_attempts and retry_interval should be tuned for this check to perhaps set to 15 minute retry interval and max_check_attempts=8 would give a 2 hour window before sending a notification.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-05-06:

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-05-06:

Here is the part of charmhelpers that sets up the service template. This would need to be updated to support adding in these additional variables with a call to charmhelpers.contrib.charmsupport.nrpe.add_check. Maybe add in some additional template_kwargs fo r future expandability into charmhelpers for use by various charms that might need to setup such tuning?

https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/charmsupport/nrpe.py#L131-L143

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2020-05-13:

+1 for adding template kwargs to charmhelpers, this would be useful for other checks as well. Going to add "affects" for charmhelpers here

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2020-05-13:

Also got this for a Cisco system (just a few mins) ftr.:

Manufacturer: Cisco Systems Inc
Product Name: UCSC-C240-M5S

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-05-13:

This will require charm-nagios to grow a new feature. See bug https://bugs.launchpad.net/charm-nrpe/+bug/1877400 which will need to be worked and closed before this specific check can get the options added.

Revision history for this message

Eric Chen (eric-chen) wrote on 2023-09-22:

This charm is no longer being actively maintained. Please consider using the new hardware-observer-operator instead. (https://github.com/canonical/hardware-observer-operator)
Therefore, I mark this issue as won't fix

Changed in charm-hw-health:
status:	Confirmed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.