Comment 0 for bug 1943929

Revision history for this message
Trent Lloyd (lathiat) wrote :

Rolling restarts of the RabbitMQ nodes can leave some queues without a master, hanging and not responding to client requests (They get a timeout). Setting the ha-promote-on-shutdown=always policy by default can make this much less likely to occur and is recommended or used in various other OpenStack implementations and documents. We should consider this default.

By default, a node is only promoted to queue master if it is synchronised with the old master even if that old master shuts down. In cases where that is not the case, such as a rolling restart of multiple nodes, it's possible none of the nodes are synchronised and the queue gets stuck offline with no synchronised nodes. In this case a manual trigger is required to synchronise it.

Having tested this change to the charm, it is still sometimes possible but much more difficult to reproduce the situation with this change applied. In one set of tests I could reproduce the issue with the defaults 3 out of 3 times. With this policy applied 2 out of 3 times it worked and the third time all queues were still available but 285/500 of them did not have all 3 mirrors running.

I believe that last case (some mirrors not running) is caused by an as-yet-identified RabbitMQ bug which I will try to address separately. But even in that case, a further restart of only 1 node recovered those queues back to a full set of mirrors. A much better situation than requiring all nodes to be stopped at the same time, then started at the same time which is currently the only way to workaround the situation otherwise (a rolling restart will not fix the issue, the entire cluster must stop at the same time)

With ha-promote-on-shutdown=always a node will always be promoted even if not synchronised, in this way we value availability over losing messages. Generally messages in OpenStack are transient and not useful a short time after they are sent (30-300 seconds) as typically they are either retransmitted as a retry or no longer valid. In contrast, once the message queues get stuck not working for a period of time similar, many useful messages such as VM boots, security group updates, etc are lost and leave neutron and other services in an even worse state than having lost the original messages it was trying to save.

This option is documented or recommended in the following resources:1
(1) Upstream Documentation
https://www.rabbitmq.com/ha.html "Stopping Nodes Hosting Queue Leader with Only Unsynchronised Mirrors"

(2) Recommended configuration in OpenStack Wiki
https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit Recommended configuration

(3) Recommended solutions to queues being stuck down https://github.com/rabbitmq/rabbitmq-server/issues/1607

(4) Default used in puppet-triplo:
https://opendev.org/openstack/puppet-tripleo/commit/610c8d8d41cd4b6bfd228ce1012416e424db625d
https://bugs.launchpad.net/tripleo/+bug/1823305

(5) Upstream bug
https://github.com/rabbitmq/rabbitmq-server/issues/1607

[Test Case]
This issue is best reproduced on Bionic 18.04. It is harder to reproduce on Focal 20.04 due to a number of bug fixes but still possible particularly if you also have network partitions.

Generally speaking, restarting all 3 servers at approximately the same time is likely to trigger the issue. In some cases, especially where a cluster partition had previously occurred (even days ago), restarting only 1 or 2 of the servers may also trigger the situation.

I found the following scenario to reliably reproduce it most of the time I attempt it:

(in parallel at the same time)
rabbitmq-server/0: sudo systemctl restart rabbitmq-server
rabbitmq-server/1: sudo systemctl restart rabbitmq-server

(as soon as one of the above restarts returns to the command prompt)
rabbitmq-server/2: sudo systemctl restart rabbitmq-server

Depending on the speed of the underlying server and number of queues created (a basic openstack install seems to have around 500 queues), you may need to experiment a little with the exact timing. It can be reproduced with all cluster-partition-handling settings though some settings change exactly how reproducible it is by which timing.

Changing one of the charm config options causes the charm to also do such a rolling restart and is also likely to reproduce the issue. The default 30 second known-wait between restarts makes it slightly less reliable to reproduce than the above but still happens but can depend a little on the speed and size of your environment. It’s a bit racy.

[Symptoms]

Some or all of the following occurs

(a) A random subset of the queues will hang. Queues still exist but disappear from the "rabbitmqctl list_queues -p openstack" output. They can only be seen if you enable the management plugin and query with rabbitmqadmin or use the web interface. Then you will see the queues are listed but with no statistics or mirrors. Basically just only the name is listed.

(b) Clients fail to use or declare the queue. This action times out after 30 seconds and logs the following error on the server side:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-bgp-plugin' in vhost 'openstack' due to timeout

(c) Errors like the following are logged to <email address hidden>
=ERROR REPORT==== 17-Sep-2021::06:27:34 ===
Discarding message {'$gen_call',{<0.12580.0>,#Ref<0.2898216055.2000945153.142860>},stat} from <0.12580.0> to <0.2157.0> in an old incarnation (2) of this node (3)

(d) The queue has no active master due to the default ha-promote-on-shutdown=when-synced policy

=WARNING REPORT==== 16-Sep-2016::10:32:57 ===
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

(e) In some cases the queue is alive, however, fails to synchronise 1 or more of the slaves and it now has reduced redundancy running on only 2 or 1 node.

[Resolution]
We should add ha-promote-on-shutdown=always as a configuration option with a default of true as it will eliminate many of these outages requiring manual intervention.

Note that upstream documents not to use ha-promote-on-shutdown=always with ha-mode=exactly. To my knowledge this is not commonly used by charm users but we should exclude the setting in such a case anyway.

I will follow-up with a changeset implementing this request and a nagios check which reliably reports these failures that can be used for testing the change.