Comment 4 for bug 1716577

Revision history for this message
kamlesh parmar (kparmar) wrote :

Do we need to do anything in the configuration to bring back the rabbitmq cluster to operational state automatically, after unplanned shutdown of the cluster nodes?

The 5a10s31 node is not starting rabbitmq-server. With this in the logs:

BOOT FAILED
===========

Timeout contacting cluster nodes: [rabbit@5a10s29ctrl,rabbit@5a10s30ctrl].

BACKGROUND
==========

This cluster node was shut down while other nodes were still running.
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.

DIAGNOSTICS
===========

attempted to contact: [rabbit@5a10s29ctrl,rabbit@5a10s30ctrl]

rabbit@5a10s29ctrl:
  * unable to connect to epmd (port 4369) on 5a10s29ctrl: address (cannot connect to host/port)

rabbit@5a10s30ctrl:
  * unable to connect to epmd (port 4369) on 5a10s30ctrl: address (cannot connect to host/port)

current node details:
- node name: rabbit@5a10s31ctrl

From the rabbitmq-server documentation, it seems like some administrative action is required. This in case of unplanned shutdown:
https://www.rabbitmq.com/man/rabbitmqctl.1.man.html

force_boot

Ensure that the node will start next time, even if it was not the last to shut down.
Normally when you shut down a RabbitMQ cluster altogether, the first node you restart should be the last one to go down, since it may have seen things happen that other nodes did not. But sometimes that's not possible: for instance if the entire cluster loses power then all nodes may think they were not the last to shut down.
In such a case you can invoke rabbitmqctl force_boot while the node is down. This will tell the node to unconditionally start next time you ask it to. If any changes happened to the cluster after this node shut down, they will be lost.
If the last node to go down is permanently lost then you should use rabbitmqctl forget_cluster_node --offline in preference to this command, as it will ensure that mirrored queues which were mastered on the lost node get promoted.
For example:
rabbitmqctl force_boot
This will force the node not to wait for other nodes next time it is started.