It seems that the root cause of the issue is that RabbitMQ restart took too much time on node-3: it went down at 00:17 and started back only at 00:26, as it can be seen in lrmd.log from node-3. The restart itself was triggered by updating host_ip OCF parameter.
The cause of long restart seem to lie in that stop action failed:
2016-09-23T00:17:33.532795+00:00 err: ERROR: RMQ-runtime (beam) couldn't be stopped and will likely became unmanaged. Take care of it manually!
2016-09-23T00:17:33.538996+00:00 info: INFO: p_rabbitmq-server[10049]: stop: action end.
It led Pacemaker to consider it failed:
Sep 23 00:17:33 [9134] node-1.test.domain.local attrd: info: attrd_cib_callback: Update 151 for fail-count-p_rabbitmq-server[node-3.test.domain.local]=INFINITY: OK (0)
To sum up: we need to fix OCF script stop action so that it does not fail sporadically. The fix will benefit Mitaka code as well.
It seems that the root cause of the issue is that RabbitMQ restart took too much time on node-3: it went down at 00:17 and started back only at 00:26, as it can be seen in lrmd.log from node-3. The restart itself was triggered by updating host_ip OCF parameter.
The cause of long restart seem to lie in that stop action failed: 23T00:17: 33.532795+ 00:00 err: ERROR: RMQ-runtime (beam) couldn't be stopped and will likely became unmanaged. Take care of it manually! 23T00:17: 33.538996+ 00:00 info: INFO: p_rabbitmq- server[ 10049]: stop: action end.
2016-09-
2016-09-
It led Pacemaker to consider it failed: test.domain. local attrd: info: attrd_cib_callback: Update 151 for fail-count- p_rabbitmq- server[ node-3. test.domain. local]= INFINITY: OK (0)
Sep 23 00:17:33 [9134] node-1.
To sum up: we need to fix OCF script stop action so that it does not fail sporadically. The fix will benefit Mitaka code as well.