RabbitMQ OCF timeout should be used without 'su' childs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
Critical
|
Alexander Nevenchannyy | ||
5.1.x |
Fix Committed
|
Critical
|
Bogdan Dobrelya | ||
6.0.x |
Fix Committed
|
Critical
|
Bogdan Dobrelya |
Bug Description
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout is being used for rabbitmqctl stop, start and wait, which uses a 'su': sh -x /usr/sbin/
Here is an example flow (from atop binary logs):
http://
Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000
Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)
The solution is to issue all timeout wrapped rabbitmqctl commands as a
rabbitmq user, so the rabbitmqctl would not have to use the 'su'.
This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
Changed in fuel: | |
milestone: | none → 6.1 |
importance: | Undecided → Critical |
assignee: | nobody → Fuel Library Team (fuel-library) |
status: | New → Confirmed |
description: | updated |
Changed in fuel: | |
assignee: | MOS Linux (mos-linux) → Bogdan Dobrelya (bogdando) |
status: | New → In Progress |
description: | updated |
description: | updated |
tags: | added: scale |
description: | updated |
summary: |
- RabbitMQ OCF timeout does not kill child processes + RabbitMQ OCF timeout should be used without 'su' childs |
Changed in fuel: | |
assignee: | Bogdan Dobrelya (bogdando) → Sergii Golovatiuk (sgolovatiuk) |
Changed in fuel: | |
assignee: | Sergii Golovatiuk (sgolovatiuk) → Bogdan Dobrelya (bogdando) |
Changed in fuel: | |
assignee: | Bogdan Dobrelya (bogdando) → Alexander Nevenchannyy (anevenchannyy) |
I believe the proper fix will be to submit a bug for timeout - it should be able to kill all process tree - and fix timeout package internally for Fuel, so we could not have to wait the upstream fix.