charm-openstack-service-checks

check_octavia.py should provide more information on nagios status line, or should log errors to a log file

Bug #1955592 reported by Paul Goins on 2021-12-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	charm-openstack-service-checks	New	Undecided	Unassigned

Bug Description

I'm frequently observing Octavia alerts that something is amiss, however by the time I can go take a look, the issue has sometimes self-resolved and I can't run the associated check by hand to determine the details of what went wrong. Or, alternatively, while reviewing events which have occurred previoiusly, the events raised in Nagios lack enough information to allow for meaningful action.

I haven't looked deeply enough, but this may be especially the case when there's something ignored. I get a nagios message which looks like this:

CRITICAL: total_alarms[1], total_crit[1], total_ignored[0], ignoring r'(?:<IGNORED_UUID>)

...Unfortunately, this doesn't give me anything meaningful in event history in Nagios to look at. I don't even know what load balancer or pool had the critical error; I just know that *something* was wrong.

I see in the script that we construct a message object by joining multiple strings together with newlines. We may want to consider a different method which results in longer but more useful strings, or we may want to consider having this script also write to a log file so as to allow for longer responses in a way which would be captured by Graylog, or at the very least have something on disk that we can look at after the fact.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.