Pacemaker refused to recover RabbitMQ cluster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Incomplete
|
High
|
Performance QA | ||
Mitaka |
Invalid
|
High
|
Performance QA | ||
Newton |
Incomplete
|
High
|
Performance QA |
Bug Description
Version: 9.0
Steps to reproduce:
1. Run some Rally tests on scale
Results:
1. While tests were running, RabbitMQ nodes started to die one by one. Below is the timeline:
node-202 - 2016-06-07T13:01:14
node-203 - 2016-06-07T16:17:47
node-201 - 2016-06-08T11:47:13
Each time Pacemaker did not bring node back and we ended up with RabbitMQ cluster completely down. The situation resolved only after manual intervention. In pacemaker.log from node-1 one can see the following lines:
Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:1 cannot run anywhere
Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:2 cannot run anywhere
Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:0 cannot run anywhere
(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:
actual result
expected result
For more detailed information on the contents of each of the listed sections see https:/ /wiki.openstack .org/wiki/ Fuel/How_ to_contribute# Here_is_ how_you_ file_a_ bug