RabbitMQ cluster node removal operation may hang for ever as rabbitmqctl may hang
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
High
|
Bogdan Dobrelya | ||
5.1.x |
Won't Fix
|
High
|
Denis Meltsaykin | ||
6.0.x |
Won't Fix
|
High
|
Denis Meltsaykin |
Bug Description
This bug is not easy to reproduce. I managed to reproduce it only after ~300 consequent node failovers. The repro steps can be found here: https:/
The issue is what the following commands may does not work as expected (we're expecting that disconnecting a node should help to kick it from the cluster, but the disconnect sometimes may fail and return false):
# rabbitmqctl eval "disconnect_
and hangs for ever ending up in the situation when none of rabbitmq nodes can re-join the cluster on faiover because they can't be forgotten and join_cluster reports they are already clustered.
Note, that for the given scenario, the AMQP cluster retains completely down as nodes cannot join mnesia master and the latter one is running in
broken state - rabbitmqctl list_channels hangs as well. Perhaps, only solution is to detect in monitor if list_channels hangs and restart the
affected nodes. This will introduce full cluster downtime until new mnesia-master elected but at least will ensure the cluster reassembled.
ISO info:
build_id: 2015-05-20_08-41-33
build_number: '441'
but manifests was synced with current master.
Changed in fuel: | |
status: | New → Confirmed |
importance: | Undecided → High |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
milestone: | none → 6.1 |
summary: |
- RabbitMQ may hang on the cluster node removal + RabbitMQ cluster node removal operation may hang for ever |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
summary: |
- RabbitMQ cluster node removal operation may hang for ever + RabbitMQ cluster node removal operation may hang for ever as rabbitmqctl + may hang |
Changed in fuel: | |
status: | Confirmed → In Progress |
tags: | added: ha rabbitmq |
Example of the lrmd.log 27T09:00: 45.267143+ 00:00 info: INFO: p_rabbitmq-server: unjoin_ nodes_from_ cluster( ): node 'rabbit@node-1' disconnected succesfully. 27T09:01: 36.174799+ 00:00 info: INFO: p_rabbitmq-server: unjoin_ nodes_from_ cluster( ): Execute forget_cluster_node with timeout: 60 27T09:02: 36.220831+ 00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/ rabbitmqctl forget_cluster_node rabbit@node-1 27T09:02: 36.224676+ 00:00 warning: WARNING: p_rabbitmq-server: unjoin_ nodes_from_ cluster( ): unjoining node 'rabbit@node-1' failed.
2015-05-
2015-05-
2015-05-
2015-05-