check_octavia.py should provide more information on nagios status line, or should log errors to a log file

Bug #1955592 reported by Paul Goins
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-openstack-service-checks
New
Undecided
Unassigned

Bug Description

I'm frequently observing Octavia alerts that something is amiss, however by the time I can go take a look, the issue has sometimes self-resolved and I can't run the associated check by hand to determine the details of what went wrong. Or, alternatively, while reviewing events which have occurred previoiusly, the events raised in Nagios lack enough information to allow for meaningful action.

I haven't looked deeply enough, but this may be especially the case when there's something ignored. I get a nagios message which looks like this:

  CRITICAL: total_alarms[1], total_crit[1], total_ignored[0], ignoring r'(?:<IGNORED_UUID>)

...Unfortunately, this doesn't give me anything meaningful in event history in Nagios to look at. I don't even know what load balancer or pool had the critical error; I just know that *something* was wrong.

I see in the script that we construct a message object by joining multiple strings together with newlines. We may want to consider a different method which results in longer but more useful strings, or we may want to consider having this script also write to a log file so as to allow for longer responses in a way which would be captured by Graylog, or at the very least have something on disk that we can look at after the fact.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.