Nagios reports "NRPE: Unable to read output" for ntp check
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
NRPE Charm |
New
|
Undecided
|
Unassigned | ||
lxd |
New
|
Undecided
|
Unassigned |
Bug Description
I'm seeing a status "UNKNOWN: NRPE: Unable to read output" in a check for NTP in nagios.
I have dozens of checks of this same charm and only one is reporting that.
When entering the unit and manually running the NRPE check, I get:
root@brtlvlty05
Traceback (most recent call last):
File "/opt/ntpmon-
main()
File "/opt/ntpmon-
checkobjs = ntpchecks(
File "/opt/ntpmon-
(output, elapsed) = execute('peers', debug=debug, implementation=
File "/opt/ntpmon-
output = execute_
File "/opt/ntpmon-
output = subprocess.
File "/usr/lib/
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/
self.
File "/usr/lib/
raise child_exception
FileNotFoundError: [Errno 2] No such file or directory: 'ntpq'
While in another (good) machine I get:
root@brtlvlty05
OK: offset is -0.000331 | frequency=-6.007000 offset=-0.000331 peers=8 reach=100.000000 result=0 rootdelay=0.009868 rootdisp=0.001637 runtime=157263 stratum=3 sync=1.000000 sysjitter= sysoffset=
So, looking at the code, it tires to detect if the server is running ntpd or chrony, and uses different external commands to get information accordingly to the daemon it detects. This code is located at /opt/ntpmon-
The bad one:
root@brtlvlty05
Running ntpd
{'runtime': 165948.77594351768}
Traceback (most recent call last):
File "./process.py", line 220, in <module>
main()
File "./process.py", line 213, in main
checkobjs = ntpchecks(checks, debug=True, implementation=
File "./process.py", line 141, in ntpchecks
(output, elapsed) = execute('peers', debug=debug, implementation=
File "./process.py", line 105, in execute
output = execute_
File "./process.py", line 46, in execute_subprocess
output = subprocess.
File "/usr/lib/
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/
self.
File "/usr/lib/
raise child_exception
FileNotFoundError: [Errno 2] No such file or directory: 'ntpq'
The good one:
root@brtlvlty05
Running chronyd
{'runtime': 157434.05309963226}
#,x,PHC0,
^,+,10.
^,*,10.
^,+,10.
^,+,10.
^,+,10.
^,+,10.
^,+,10.
elapsed time: 0.003 seconds
0A81B216,
elapsed time: 0.004 seconds
ntpmon frequency=
ntpmon_
ntpmon_
ntpmon_
ntpmon_
ntpmon_
ntpmon_
ntpmon_
ntpmon_
So, it clearly thinks one of them is using chrony and the other is using ntpd. But, in both cases, ntpd package is not even installed (but chrony is):
root@brtlvlty05
ii chrony 3.5-6ubuntu6.2 amd64 Versatile implementation of the Network Time Protocol
ii ntpdate 1:4.2.8p12+
rc systemd-timesyncd 245.4-4ubuntu3.5 amd64 minimalistic service to synchronize local time with NTP servers
root@brtlvlty05
ii chrony 3.5-6ubuntu6.2 amd64 Versatile implementation of the Network Time Protocol
ii ntpdate 1:4.2.8p12+
rc systemd-timesyncd 245.4-4ubuntu3.5 amd64 minimalistic service to synchronize local time with NTP servers
Again checking the code, it is not checking for the installed package but instead by which daemon is *running* in the proccess list. And this is the problem, we can find an "ntpd" running in one of them:
root@brtlvlty05
1000111 245835 0.0 0.0 107772 4444 ? Ssl Mar23 0:11 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 111:116
And looking closely, it is inside a snap:
root 30547 0.0 0.0 1898184 17428 ? Ss Mar23 0:00 [lxc monitor] /var/snap/
1000000 30568 0.0 0.0 225560 9544 ? Ss Mar23 0:20 \_ /sbin/init
1000111 245835 0.0 0.0 107772 4444 ? Ssl Mar23 0:11 \_ /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 111:116
It makes sense to have lxd initialized because these are baremetals with lots of lxds inside (both examples). But the question is why one of them initialized ntpd and the other did not.
Also, the ntpmon check should not be relying on any ntpd running to "guess" it should use ntpd commands to get status.
Workaround:
Edit process.py, on line 166, change 'ntpd' for something else like 'ntpdxxx', it will not match anymore and chrony clients will be used.
Added lxd because it seems wrong to start ntpd inside the charm when the host is managing time using chrony. And, if this is right, then it's unclear why it did not happen on the other units.