Ubuntu
pcp package

Comment 29 for bug 1850281

Revision history for this message

In Red Hat Bugzilla #1721223, mgoodwin (mgoodwin-redhat-bugs) wrote on 2019-11-07:

#29

re-opening this one, despite it being DUP'd elsewhere since people seem to be using this BZ rather than any of the others.

After a fair bit of poking around with difficult-to-debug systemd configs and PCP log management scripts, it looks like we basically have what's known as a "readyness protocol mismatch" with what systemd is expecting. Given the PCP rc scripts pre-date systemd by about 2 decades, it's not surprising.

Basically, systemd runs the PCP pmlogger.service rc script (/usr/share/pcp/lib/pmlogger), which then runs pmlogger_check to start the pmlogger service (to start at least the primary logger, but also any other loggers configured in the control file or control.d directory in a logging farm configuration).

pmlogger_check then forks off a background shell function which then forks off each pmlogger process with appropriate options (as per the control file), and then busy waits in a loop checking with pmlc for the new pmlogger process. Once each pmlogger is started, pmlogger_check writes the pid files and does various other things and then exits - and then the rc script exits. With a Type=forking config, this is supposed to signal to systemd that the service has started.

Unfortunately however, systemd is impatient and doesn't cope very well with the double forking and usually ends up killing the entire process tree (witness signal 15 messages in pmlogger.log), and reporting the failure as a service start timeout. Since we have restart=always in the systemd config, yet another rc script is then re-launched, which often succeeds because the initial pmlogconf work was previously completed for a
new installation, and we end up with one or more pmlogger daemons active. To complicate things, some PCP QA tests that have been interrupted may have left the pmlogger systemd config with Restart=no, but this only affects systems running QA - i.e. mostly PCP developers.

The fix will involve converting to the modern Type=notify readyness protocol - so each pmlogger sends an sd_notify message to systemd that it has started and completed initialization. This should function regardless of how much forking goes on with the rc and log management scripts. We'll also need to split out each pmlogger in a farm configuration so they're individually managed by systemd as distinct service units (there is a templating facility for this). Similar changes will be needed for other PCP services.

References:
https://unix.stackexchange.com/questions/401590/systemd-timeout-because-it-doesnt-detect-daemon-forking
https://unix.stackexchange.com/questions/200280/systemd-kills-service-immediately-after-start/200365#200365
https://unix.stackexchange.com/questions/336031/systemd-service-restarts-every-90-seconds/336067#336067
.. and many others - this is not an uncommon problem with legacy service daemons being adapted to the systemd ecosystem.

re-opening this one, despite it being DUP'd elsewhere since people seem to be using this BZ rather than any of the others.

Unfortunately however, systemd is impatient and doesn't cope very well with the double forking and usually ends up killing the entire process tree (witness signal 15 messages in pmlogger.log), and reporting the failure as a service start timeout. Since we have restart=always in the systemd config, yet another rc script is then re-launched, which often succeeds because the initial pmlogconf work was previously completed for a 
new installation, and we end up with one or more pmlogger daemons active. To complicate things, some PCP QA tests that have been interrupted may have left the pmlogger systemd config with Restart=no, but this only affects systems running QA - i.e. mostly PCP developers.

Ubuntupcp package

Comment 29 for bug 1850281

Ubuntu
pcp package