Set ha-promote-on-shutdown=always policy by default
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
In Progress
|
Undecided
|
Trent Lloyd |
Bug Description
Rolling restarts of the RabbitMQ nodes can leave some queues without a master, hanging and not responding to client requests (They get a timeout). Setting the ha-promote-
This is one of a number of resolutions identified in Bug #1943937 "RabbitMQ queues often hang after charm config changes or rabbitmq restarts due to multiple causes (overview/
By default, a node is only promoted to queue master if it is synchronised with the old master even if that old master shuts down. In cases where that is not the case, such as a rolling restart of multiple nodes, it's possible none of the nodes are synchronised and the queue gets stuck offline with no synchronised nodes. In this case a manual trigger is required to synchronise it.
Having tested this change to the charm, it is still sometimes possible but much more difficult to reproduce the situation with this change applied. In one set of tests I could reproduce the issue with the defaults 3 out of 3 times. With this policy applied 2 out of 3 times it worked and the third time all queues were still available but 285/500 of them did not have all 3 mirrors running.
I believe that last case (some mirrors not running) is caused by an as-yet-identified RabbitMQ bug which I will try to address separately. But even in that case, a further restart of only 1 node recovered those queues back to a full set of mirrors. A much better situation than requiring all nodes to be stopped at the same time, then started at the same time which is currently the only way to workaround the situation otherwise (a rolling restart will not fix the issue, the entire cluster must stop at the same time)
With ha-promote-
This option is documented or recommended in the following resources:1
(1) Upstream Documentation
https:/
(2) Recommended configuration in OpenStack Wiki
https:/
(3) Recommended solutions to queues being stuck down https:/
(4) Default used in puppet-triplo:
https:/
https:/
(5) Upstream bug
https:/
[Test Case]
This issue is best reproduced on Bionic 18.04. It is harder to reproduce on Focal 20.04 due to a number of bug fixes but still possible particularly if you also have network partitions.
Generally speaking, restarting all 3 servers at approximately the same time is likely to trigger the issue. In some cases, especially where a cluster partition had previously occurred (even days ago), restarting only 1 or 2 of the servers may also trigger the situation.
I found the following scenario to reliably reproduce it most of the time I attempt it:
(in parallel at the same time)
rabbitmq-server/0: sudo systemctl restart rabbitmq-server
rabbitmq-server/1: sudo systemctl restart rabbitmq-server
(as soon as one of the above restarts returns to the command prompt)
rabbitmq-server/2: sudo systemctl restart rabbitmq-server
Depending on the speed of the underlying server and number of queues created (a basic openstack install seems to have around 500 queues), you may need to experiment a little with the exact timing. It can be reproduced with all cluster-
Changing one of the charm config options causes the charm to also do such a rolling restart and is also likely to reproduce the issue. The default 30 second known-wait between restarts makes it slightly less reliable to reproduce than the above but still happens but can depend a little on the speed and size of your environment. It’s a bit racy.
[Symptoms]
Some or all of the following occurs
(a) A random subset of the queues will hang. Queues still exist but disappear from the "rabbitmqctl list_queues -p openstack" output. They can only be seen if you enable the management plugin and query with rabbitmqadmin or use the web interface. Then you will see the queues are listed but with no statistics or mirrors. Basically just only the name is listed.
(b) Clients fail to use or declare the queue. This action times out after 30 seconds and logs the following error on the server side:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-bgp-plugin' in vhost 'openstack' due to timeout
(c) Errors like the following are logged to <email address hidden>
=ERROR REPORT==== 17-Sep-
Discarding message {'$gen_
(d) The queue has no active master due to the default ha-promote-
=WARNING REPORT==== 16-Sep-
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
(e) In some cases the queue is alive, however, fails to synchronise 1 or more of the slaves and it now has reduced redundancy running on only 2 or 1 node.
[Resolution]
We should add ha-promote-
Note that upstream documents not to use ha-promote-
I will follow-up with a changeset implementing this request and a nagios check which reliably reports these failures that can be used for testing the change.
Changed in charm-rabbitmq-server: | |
assignee: | nobody → Trent Lloyd (lathiat) |
tags: | added: sts |
description: | updated |
Fix proposed to branch: master /review. opendev. org/c/openstack /charm- rabbitmq- server/ +/813146
Review: https:/