init: restart command fails to restart main process when pre-stop stanza exists
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
upstart |
Triaged
|
Medium
|
Unassigned | ||
upstart (Ubuntu) |
Triaged
|
Medium
|
James Hunt |
Bug Description
While testing the portmap daemon's recent changes in Ubuntu which cause statd to properly follow it on stop/start, I tried to restart portmap, only to find that it was not in fact restarted.
This happens with any job that has a pre-stop. The reason is that in job_restart(), the code does this:
job_change_goal (job, JOB_STOP);
job_change_goal (job, JOB_START);
job_change_goal will return as soon as the *first* state change has been completed. If there is no pre-stop, that is the change from JOB_RUNNING to JOB_KILLED, which does dutifully kill the main process.
However, if there is a pre-stop, the pre-stop is run, but then job_change_state returns to job_change_goal, which then returns to job_restart, which then changes the goal back to start, which makes the new job_next_state() one that will bypass the change to JOB_KILLED.
This was found on upstart v0.6.7-3 in Ubuntu, but also exists in the code in the current trunk.
TEST CASE
1. create job file /etc/init/
# test-restart-
pre-stop exec /bin/true
exec /bin/sleep 3600
2. run "start test-restart-
3. immediately run "restart test-restart-
tags: | added: glucid lucid |
summary: |
- restart command fails to restart main process when pre-stop stanza + init: restart command fails to restart main process when pre-stop stanza exists |
Changed in upstart: | |
status: | Confirmed → Triaged |
importance: | Undecided → Medium |
tags: | added: testcase |
Changed in upstart (Ubuntu): | |
status: | Triaged → Incomplete |
status: | Incomplete → Confirmed |
Changed in upstart (Ubuntu): | |
status: | Confirmed → Triaged |
Changed in upstart (Ubuntu): | |
assignee: | nobody → James Hunt (jamesodhunt) |
tags: |
added: trusty removed: glucid lucid |
On further investigation, this does *not* affect jobs that block the 'stopping' event. So I believe this may be related more to the way job_process_run treates pre-stop. According to comments in init/job_process.c:
/* We don't change the state if we're in post-start and there's >process[ PROCESS_ POST_START] PROCESS_ POST_START] > 0)) { >process[ PROCESS_ PRE_STOP] PROCESS_ PRE_STOP] > 0)) {
* a post-start process running, or if we're in pre-stop and
* there's a pre-stop process running; we wait for those to
* finish instead.
*/
if ((job->state == JOB_POST_START)
&& job->class-
&& (job->pid[
state = FALSE;
} else if ((job->state == JOB_PRE_STOP)
&& job->class-
&& (job->pid[
state = FALSE;
}
Later on, if (!state) calls job_change_ goal(job, JOB_RESPAWN) ..
Digging into job_next_state a bit, I found that a goal of JOB_RESPAWN causes job_next_state to change the goal only if state is JOB_POST_START or JOB_PRE_STOP. JOB_POST_START changes the goal to JOB_START, which makes sense. But JOB_PRE_STOP changes the goal to JOB_START as well! This doesn't make much sense to me, and I thought the obvious fix would be to change it to be JOB_STOP, however this caused test_job to fail here:
...with post-start job and a goal of respawn job.c:4129 (test_next_state).
BAD: wrong value for job->goal, expected 1 got 0
at tests/test_
/bin/bash: line 5: 29784 Aborted ${dir}$tst
FAIL: test_job
This was confusing until I realized that the test code is wrong:
/* Check that the next state if we're respawning after a pre-stop
* job is stopping with the goal changed.
*/
TEST_FEATURE ("with post-start job and a goal of respawn");
job->goal = JOB_RESPAWN;
job->state = JOB_PRE_STOP;
TEST_EQ (job_next_state (job), JOB_STOPPING);
TEST_EQ (job->goal, JOB_START);
That should read 'with pre-stop job and a goal of respawn'.
So, this brings to the edge case that I think is causing this problem, and my perceived fix.
If you want a running job to be restarted, it should move from wherever it was, through stopping- >pre-stop- >killed- >post-stop- >starting- > ...
However, this doesn't happen. Instead the goal is changed to STOP, but job_change_goal() does not wait for that to happen, which its docblock would suggest.
Instead, it returns with the state set to JOB_PRE_STOP ..
This is my proposed fix.. though I'm not entirely confident this is the right way to handle it, and I haven't had a chance to test this yet.
=== modified file 'init/job.c'
--- init/job.c 2010-12-14 15:32:41 +0000
+++ init/job.c 2011-01-17 07:56:24 +0000
@@ -1265,6 +1265,12 @@
nih_list_add (&job->blocking, &blocked->entry);
job_change_goal (job, JOB_STOP);
+ if (job->state == JOB_PRE_STOP)
+ {
+ /* hack: should not happen but does */
+ job->goal = JOB_RESPAWN;
+ job_change_goal (job, JOB_STOP);
+ }
job_change_goal (job, JOB_START);
if (! wait)