Ceph OSD Charm

NRPE check for ceph-osd fails with: File '/var/lib/nagios/ceph-osd-checks' doesn't exist

Bug #2019251 reported by Pedro Castillo on 2023-05-11

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	Triaged	Medium	Unassigned

Bug Description

The NRPE check removes the temporary output file created by the collector cronjob, which introduces a race condition that will make the check fail if the file is not present. While the commit where the check was added (faefe90ce6beb5d2b3721cfb637dd7d661cbc21f) mentions that the file is removed so that the check will always have fresh data, it can fail and has been failing on an environment for 10+ hours straight. I think having minute-old stale data is better than having no data at all and having the check fail, so I believe that removing the section of the check that removes the output file [0] would fix the issue, as the collector CRON job already handles the output file being present by using the write+truncate mode on the open operation [1].

[0] https://opendev.org/openstack/charm-ceph-osd/src/branch/master/files/nagios/check_ceph_osd_services.py#L41-L45
[1] https://opendev.org/openstack/charm-ceph-osd/src/branch/master/files/nagios/collect_ceph_osd_services.py#L72

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2023-05-31:

Agreed that the race this could create can lead to false positives which causes alert fatigue and ultimately can render a check useless (although I can't help but wonder if there's an underlying env issue if it's failing in an env for hours on end as a side note).

Changed in charm-ceph-osd:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Danny Cocks (dannycocks) wrote on 2023-10-10:

Adding to this, we now have a common root cause for this bug: we are deploying the COS stack while maintaining an old LMA stack. This means we have two upstream users of the NRPE check, which can be triggering in the same minute. In our case we find:

- collect_ceph_osd_services.py runs every minute at 01s
- nagios causes check_ceph_osd_services.py to run every 05:10s
- prometheus causes check_ceph_osd_services.py to run every 05:18s.

This means that prometheus will not see the file before it is regenerated. And it is not possible for us to ensure that these checks run at least one minute apart.

A simple "fix" is to not delete the file. But to preserve the implicit check for staleness, could we add a check for modification time of the file? Or use a second file to record the timestamp of the last check?

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.