NRPE check for ceph-osd fails with: File '/var/lib/nagios/ceph-osd-checks' doesn't exist

Bug #2019251 reported by Pedro Castillo
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Triaged
Medium
Unassigned

Bug Description

The NRPE check removes the temporary output file created by the collector cronjob, which introduces a race condition that will make the check fail if the file is not present. While the commit where the check was added (faefe90ce6beb5d2b3721cfb637dd7d661cbc21f) mentions that the file is removed so that the check will always have fresh data, it can fail and has been failing on an environment for 10+ hours straight. I think having minute-old stale data is better than having no data at all and having the check fail, so I believe that removing the section of the check that removes the output file [0] would fix the issue, as the collector CRON job already handles the output file being present by using the write+truncate mode on the open operation [1].

[0] https://opendev.org/openstack/charm-ceph-osd/src/branch/master/files/nagios/check_ceph_osd_services.py#L41-L45
[1] https://opendev.org/openstack/charm-ceph-osd/src/branch/master/files/nagios/collect_ceph_osd_services.py#L72

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Agreed that the race this could create can lead to false positives which causes alert fatigue and ultimately can render a check useless (although I can't help but wonder if there's an underlying env issue if it's failing in an env for hours on end as a side note).

Changed in charm-ceph-osd:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Danny Cocks (dannycocks) wrote :

Adding to this, we now have a common root cause for this bug: we are deploying the COS stack while maintaining an old LMA stack. This means we have two upstream users of the NRPE check, which can be triggering in the same minute. In our case we find:

- collect_ceph_osd_services.py runs every minute at 01s
- nagios causes check_ceph_osd_services.py to run every 05:10s
- prometheus causes check_ceph_osd_services.py to run every 05:18s.

This means that prometheus will not see the file before it is regenerated. And it is not possible for us to ensure that these checks run at least one minute apart.

A simple "fix" is to not delete the file. But to preserve the implicit check for staleness, could we add a check for modification time of the file? Or use a second file to record the timestamp of the last check?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.