xenial systemd reports 'inactive' instead of 'failed' for service units that repeatedly failed to restart / failed permanently
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
systemd (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
Xenial |
Fix Released
|
Medium
|
Dimitri John Ledkov |
Bug Description
[Impact]
* In case a service unit has repeatedly failed to restart, it should be
reported as 'failed' permanently, but currently it's instead reported
as 'inactive'.
* System monitoring tools that evaluate the status of systemd service units
and act upon it (for example: restart service, report permanent failure)
are currently misled by information in 'systemctl status <unit>.service'.
* System management tools based on such information may take wrong and/or
sub-optimal actions in the managed systems regarding such service units.
* This systemd patch [1] directly addresses this issue (see systemd github
PR #3166 [2]), and its code is still effectice in upstream systemd today,
without further fixes/changes (the only changes were in doc text and the
busname files that were removed, but still without further fixes to this).
[Test Case]
* This is copied from systemd PR #3166 [2].
* This has been tested by a customer as well, and with its system monitoring
and management solution, for interoperability verification.
$ cat <<EOF | sudo tee /etc/systemd/
[Service]
ExecStart=
Restart=always
EOF
$ sudo systemctl daemon-reload
$ sudo systemctl start fail-on-restart
Before) "Active: inactive (dead)"
$ systemctl status -n0 fail-on-restart
fail-
Loaded: loaded (/etc/systemd/
Active: inactive (dead)
After) "Active: failed (Result: start-limit-hit)"
$ systemctl status -n0 fail-on-restart
fail-
Loaded: loaded (/etc/systemd/
Active: failed (Result: start-limit-hit) since Sat 2018-09-29 11:01:34 UTC; 4s ago
Process: 7066 ExecStart=
Main PID: 7066 (code=exited, status=1/FAILURE)
[Regression Potential]
* This code changes at which point the check for the number of (re)start
attempts are made, so regressions to (re)start units are theoretically
possible.
* However, this code actually reverts a change that caused a regression,
so it goes back to the code that was known to work correctly before ..
* .. and it is still in this form in upstream systemd nowadays,
without further fixes/changes (see comment in the Impact section).
[Other Info]
* Test package was built on Launchpad PPA for all architectures,
with dependencies from Proposed enabled (more up-to-date for SRU).
* The testsuite (in package build time; blocks the package build result)
has identical results to that in buildlog of current xenial-updates.
===
Testsuite summary for systemd 229
===
# TOTAL: 128
# PASS: 109
# SKIP: 19
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
===
[Links]
[1] https:/
[2] https:/
[3] https:/
Changed in systemd (Ubuntu): | |
status: | In Progress → Invalid |
assignee: | Mauricio Faria de Oliveira (mfo) → nobody |
Changed in systemd (Ubuntu Xenial): | |
status: | Triaged → In Progress |
More details on the verification of test package from Launchpad PPA)
---
Test-case)
$ cat <<EOF | sudo tee /etc/systemd/ system/ fail-on- restart. service /bin/false
[Service]
ExecStart=
Restart=always
EOF
Before) "Active: inactive (dead)"
$ dpkg -s systemd | grep Version
Version: 229-4ubuntu21.4
$ sudo systemctl daemon-reload
$ sudo systemctl start fail-on-restart
$ systemctl status -n0 fail-on-restart restart. service system/ fail-on- restart. service; static; vendor preset: enabled)
● fail-on-
Loaded: loaded (/etc/systemd/
Active: inactive (dead)
$ journalctl --no-pager -u fail-on-restart restart. service. restart. service: Main process exited, code=exited, status=1/FAILURE restart. service: Unit entered failed state. restart. service: Failed with result 'exit-code'. restart. service: Service hold-off time over, scheduling restart. restart. service. restart. service. restart. service: Main process exited, code=exited, status=1/FAILURE restart. service: Unit entered failed state. restart. service: Failed with result 'exit-code'. restart. service: Service hold-off time over, scheduling restart. restart. service. restart. service. restart. service: Main process exited, code=exited, status=1/FAILURE restart. service: Unit entered failed state. restart. service: Failed with result 'exit-code'. restart. service: Service hold-off time over, scheduling restart. restart. service. restart. service. restart. service: Main process exited, code=exited, status=1/FAILURE restart. service: Unit entered failed state. restart. service: Failed with result 'exit-code'. restart. service: Service hold-off time over, scheduling restart. restart. service. restart. service. restart. service: Main process exited, code=exited, status=1/FAILURE restart. service: Unit entered failed state. restart. service: Failed with result 'exit-code'. restart. service: Service hold-off time over, scheduli...
<...>
Sep 29 10:59:00 havers systemd[1]: Started fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-
Sep 29 10:59:00 havers systemd[1]: Started fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-
Sep 29 10:59:00 havers systemd[1]: Started fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-
Sep 29 10:59:00 havers systemd[1]: Started fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: fail-on-
Sep 29 10:59:00 havers systemd[1]: Stopped fail-on-
Sep 29 10:59:01 havers systemd[1]: Started fail-on-
Sep 29 10:59:01 havers systemd[1]: fail-on-
Sep 29 10:59:01 havers systemd[1]: fail-on-
Sep 29 10:59:01 havers systemd[1]: fail-on-
Sep 29 10:59:01 havers systemd[1]: fail-on-