Rabbitmq OCF RA: Pacemaker reports a slave running and does nothing to the resource, but lrmd logs contain a periodic error from the 2nd monitor
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
High
|
Bogdan Dobrelya | ||
Mitaka |
Fix Released
|
High
|
Bogdan Dobrelya | ||
Newton |
Fix Committed
|
High
|
Bogdan Dobrelya |
Bug Description
Note, that is a floating issue, there is no 100% repro steps.
Pacemaker sometimes reports a slave running and does nothing to the resource, but lrmd logs contain a periodic error from the 2nd monitor and the rabbitmq app is not running w/o any recovery for a long time. Here is an example I caught by running a jepsen test against a rabbit cluster:
lrmd.log:
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-
Apr 5 15:18:12 n1 lrmd: ERROR: p_rabbitmq-
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-
... snip ...
Apr 5 15:18:49 n1 lrmd: INFO: p_rabbitmq-
... snip (reoccurs every 35 sec as expected) ...
Apr 5 15:27:30 n1 lrmd: INFO: p_rabbitmq-
Apr 5 15:28:08 n1 lrmd: INFO: p_rabbitmq-
Apr 5 15:28:33 n1 lrmd: INFO: p_rabbitmq-
pacemaker.log:
Apr 05 15:13:17 [30970] n1 crmd: notice: process_lrm_event: Operation p_rabbitmq-
... snip (no more logs about error exit code!) ...
Apr 05 15:28:34 [30970] n1 crmd: notice: process_lrm_event: Operation p_rabbitmq-
So, pacemaker doesn't restart it, and doesn't "notice" errors.
But it recovers automagically ~15 min later!
Apr 05 15:28:36 [30970] n1 crmd: info: do_lrm_rsc_op: Performing key=3:107:
The solution is to stop the rabbitmq server process instead of hoping on the being restarted by a Pacemaker...
Changed in fuel: | |
importance: | Undecided → High |
milestone: | none → 9.0 |
tags: | added: pacemaker rabbitmq |
Changed in fuel: | |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
Fix proposed to branch: master /review. openstack. org/302669
Review: https:/