Fuel for OpenStack

Rabbitmq OCF RA: Pacemaker reports a slave running and does nothing to the resource, but lrmd logs contain a periodic error from the 2nd monitor

Series newton
Bug #1567355

Bug #1567355 reported by Bogdan Dobrelya on 2016-04-07

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 10.0
Mitaka	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 9.0
Newton	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 10.0

Bug Description

Note, that is a floating issue, there is no 100% repro steps.

Pacemaker sometimes reports a slave running and does nothing to the resource, but lrmd logs contain a periodic error from the 2nd monitor and the rabbitmq app is not running w/o any recovery for a long time. Here is an example I caught by running a jepsen test against a rabbit cluster:

lrmd.log:
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-server[11463]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
Apr 5 15:18:12 n1 lrmd: ERROR: p_rabbitmq-server[11463]: get_monitor(): get_status() returns generic error 1
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-server[11463]: get_monitor(): ensuring this slave does not get promoted.
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-server[11463]: master_score(): Updating master score attribute with 0
... snip ...
Apr 5 15:18:49 n1 lrmd: INFO: p_rabbitmq-server[12636]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
... snip (reoccurs every 35 sec as expected) ...
Apr 5 15:27:30 n1 lrmd: INFO: p_rabbitmq-server[27111]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
Apr 5 15:28:08 n1 lrmd: INFO: p_rabbitmq-server[28063]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
Apr 5 15:28:33 n1 lrmd: INFO: p_rabbitmq-server[29010]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker

pacemaker.log:
Apr 05 15:13:17 [30970] n1 crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_35000: unknown error (node=n1, call=171, rc=1, cib-update=27, confirmed=false)
... snip (no more logs about error exit code!) ...
Apr 05 15:28:34 [30970] n1 crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_35000: unknown error (node=n1, call=174, rc=1, cib-update=30, confirmed=false)

So, pacemaker doesn't restart it, and doesn't "notice" errors.
But it recovers automagically ~15 min later!
Apr 05 15:28:36 [30970] n1 crmd: info: do_lrm_rsc_op: Performing key=3:107:0:fd9993ea-2897-4c53-ae4c-bc30faf66315 op=p_rabbitmq-server_stop_0

The solution is to stop the rabbitmq server process instead of hoping on the being restarted by a Pacemaker...

Tags:

Bogdan Dobrelya (bogdando) on 2016-04-07

Changed in fuel:
importance:	Undecided → High
milestone:	none → 9.0
tags:	added: pacemaker rabbitmq
Changed in fuel:
assignee:	nobody → Bogdan Dobrelya (bogdando)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-07: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/302669

Changed in fuel:
status:	New → In Progress

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-04-07: Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

version

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-08: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/302669
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=fc1e8aa4c79a3181fa880ff18f883c237de2dd06
Submitter: Jenkins
Branch: master

commit fc1e8aa4c79a3181fa880ff18f883c237de2dd06
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Apr 7 12:58:59 2016 +0200

Stop a rabbitmq pacemaker resource when monitor fails

Upstream PR https://github.com/rabbitmq/rabbitmq-server/pull/731
Closes-bug: #1567355

Change-Id: I83415e0e2a40f0e99e7baa26e35b6f7463c52928
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-04-19:

related patch https://review.openstack.org/#/c/307623/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-19: Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/307635

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-20: Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/307635
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b684391018bb3d8cac20083d9217ba821cf02384
Submitter: Jenkins
Branch: stable/mitaka

commit b684391018bb3d8cac20083d9217ba821cf02384
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Apr 7 12:58:59 2016 +0200

Stop a rabbitmq pacemaker resource when monitor fails

Upstream PR https://github.com/rabbitmq/rabbitmq-server/pull/731
Closes-bug: #1567355

    Change-Id: I83415e0e2a40f0e99e7baa26e35b6f7463c52928
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit fc1e8aa4c79a3181fa880ff18f883c237de2dd06)

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-04-26:

Fix was missing on 9.0-242. Waiting a new iso

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-04-29:

Verified as fixed in 9.0-254.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.