RabbitMQ queues often hang after charm config changes or rabbitmq restarts due to multiple causes (overview/co-ordination bug)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
In Progress
|
High
|
Trent Lloyd |
Bug Description
[Issue]
RabbitMQ restarts, particularly rolling restarts of multiple nodes, can leave the queues in a bad state that is difficult to diagnose and difficult to recover, requiring all nodes to be stopped simultaneously and then started again (a rolling restart does not resolve it).
This is often triggered by the charm itself as part of a config-changed event as all the servers get restarted 30 seconds apart (due to the default known-wait=30) combined with it currently re-applying some of the queue configuration such as enabling mirroring, HA policies, etc at the same time.
This can be reliably and easily reproduced with any cluster-
This is happening frequently in production deployments (on a weekly basis) causing high severity cases and cloud downtime with a high impact to users. These issues have been persisting for a long period of time and caused much confusion. I have attempted to comprehensively research and document this and this bug is the result of that work. As you’ll see from the below data there are a large number of items related to this that require attention. This bug is intended as a ‘covering bug’ to document the various causes and spin off smaller bugs to fix relevant pieces. There is some overlap between the fixes though and it’s possible not all will be required depending on which are expected.
Please note that I appreciate this bug description is VERY long, however, the issue truly appears to be that complex. I will split each individual fix into a separate bug to handle it’s resolution but wanted to track the overarching and inter-related situation somewhere.
[Test Case]
This issue is best reproduced on Bionic 18.04. It is harder to reproduce on Focal 20.04 due to a number of bug fixes but still possible particularly if you also have network partitions.
Generally speaking, restarting all 3 servers at approximately the same time is likely to trigger the issue. In some cases, especially where a cluster partition had previously occurred (even days ago), restarting only 1 or 2 of the servers may also trigger the situation.
I found the following scenario to reliably reproduce it most of the time I attempt it when used in an openstack-
(in parallel at the same time)
rabbitmq-server/0: sudo systemctl restart rabbitmq-server
rabbitmq-server/1: sudo systemctl restart rabbitmq-server
(as soon as one of the above restarts returns to the command prompt)
rabbitmq-server/2: sudo systemctl restart rabbitmq-server
Depending on the speed of the underlying server and number of queues created (a basic openstack install seems to have around 500 queues), you may need to experiment a little with the exact timing. It can be reproduced with all cluster-
Changing one of the charm config options causes the charm to also do such a rolling restart and is also likely to reproduce the issue. The default 30 second known-wait between restarts makes it slightly less reliable to reproduce than the above but still happens but can depend a little on the speed and size of your environment. It’s a bit racy.
[Symptoms]
A random subset of the queues will then hang. Some or all of the following symptoms are observed
(a) The queues disappear from the output of "rabbitmqctl list_queues -p openstack" entirely even though they exist. The only way to notice their existence and broken state is via the Management Plugin REST API (consumed directly, via the web interface or via rabbitmqadmin). In that case the queues are listed but with no statistics or mirrors. Basically just only the name is listed.
(b) Clients fail to use or declare the queue. This action times out after 30 seconds and logs the following error on the server side:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-bgp-plugin' in vhost 'openstack' due to timeout
(c) “Old incarnation” errors like the following are persistently logged to /<email address hidden>
=ERROR REPORT==== 17-Sep-
Discarding message {'$gen_
(d) The queue has no active master due to the default ha-promote-
=WARNING REPORT==== 16-Sep-
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
(e) In theory when the original master comes back, the queue should manage to go alive again, however it seems when the queue hangs as part of item (a) and (b) this original master gets stuck and can never recover, particularly on Bionic’s 3.6.10. This seems less common on Focal’s 3.8.2 but still happens.
(f) You cannot delete the queue in order to recreate it. Known bug fixed in 3.6.16. https:/
(g) In some cases the queue is alive, however, fails to synchronise 1 or more of the slaves and it now has reduced redundancy running on only 2 or 1 node. This also happens consistently on Focal’s 3.8 as well as Bionic’s 3.6.
[Recovery]
When this happens, a rolling restart of the cluster (systemctl restart rabbitmq-server) does not repair the situation whether you do them 1 at a time or restart all 3 at the same time which is the action most people take. If anything it makes it worse as it is generally the trigger of it happening in the first place.
The only way to recover from the situation is to stop all 3 nodes, wait until they are all stopped, then start all 3 again. Then it recovers all of the queues reliably.
[Possible Solutions]
I found the following solutions/related bugs. I spent a significant amount of time reproducing and researching the situation and it seems this is caused by an aggregate of a large number of different bugs and possible configuration/charm changes. I will use this as a tracking bug for implementing related fixes in additional bugs focussed on each item.
(1) Move to Quorum Queues long term
In general the RabbitMQ project documented multiple times that classic HA queues have a number of these problems that may not always be solved and that we should move to "Quorum Queues" which use a proper consensu algorithm and Classic HA Queues are being deprecated. While there are some fixes to make the classic queues work better, we should look to add Quorum Queue support particularly for newer releases.
See for example this bug about the 'old incarnation' messages in many ways stating that basically they won't fix it and you should move to Quorum queues instead:
https:/
Action Required: Propose spec for Quorum Queue support and prioritise implementation. Will need to include a review of the quality of this feature in Focal’s 3.8.2 it seems there are many bug fixes for this feature in future point releases as it was a new feature in this version.
(2) Set policy ha-promote-
By default, a node is only promoted to master if it is synchronised with the old master. In cases where that is not the case, such as a rolling restart of multiple nodes, it's possible none of the nodes are synchronised. In this case a manual trigger is required to synchronise it. This default favours consistency (not losing messages) over availability.
Using this option is recommended in numerous resources including puppet-triplo, the openstack wiki and upstream bugs and documentation. A full list of those resources is included in the created bug linked below.
Having tested this change to the charm, it is possible but much more difficult to reproduce the situation with this fix applied. In one set of tests I could reproduce the issue with the default 3 out of 3 times. With this setting applied 2 out of 3 times it worked and the third time all queues were still available but 285/500 of them did not have all 3 mirrors running. Restarting 1 of the nodes only got those queues to all re-synchronise.
Thus I think we should go ahead and add ha-promote-
Action Required: Implement this policy by default in the charm, work tracked in https:/
(3) RabbitMQ 3.8.2 in Focal generally handles this situation better than RabbitMQ 3.6.10 in Bionic.
It's possible to reproduce but much harder and usually affects less queues.
It seems to have a number of related bug fixes however I have had a lot of trouble nailing down exactly WHICH bug fixes it despite a few hours of research on the topic. We also need to evaluate the upstream stable 3.6.16 release (and 3.8.x point releases) to see if they contain any relevant fixes and either backport those exact fixes or get a micro release exception for RabbitMQ 3.6.16. I am concerned that we don't have the Erlang expertise to properly backport the various fixes - some are simple but some had substantial code changes.
Note that RabbitMQ 3.6.16 upstream technically requires a newer Erlang version than that shipped in Bionic although 3.6.15 still supports the Bionic erlang. It seems this was done under a newer policy to only support and test the last 2 years of erlang releases to work, but I cannot see any indication that they actively believe an Erlang incompatibility exists.
Additionally 3.6.16 notes 2 backwards incompatible changes that seem minor/uncommon in practice but is a regression risk. Unfortunately 3.6.16 also seems to contain a number of the related fixes.
Action Required: Test 3.6.16 to see if it works better in these scenarios, consider any relevant bugs for backport or get a micro release exception for RabbitMQ. If not possible, we may need to consider shipping a newer Erlang+RabbitMQ in the cloud-archive.
(4) Cannot delete queues without promotable master
Known issue fixed in RabbitMQ 3.6.16
https:/
https:/
Action Required: Open a bug to backport this change if possible, unless a 3.6.16 micro-release exception is granted.
(5) Upstream recommends to avoid rapid queue and mirror policy changes at the same time
From the comments in https:/
Currently the charm makes policy changes each time config-changed runs which happens at the same time all the nodes are restarted and thus slaves are added and removed. In particular it's often still making these changes after itself restart but 30 seconds had passed and the 2nd and 3rd nodes are doing their restarts.
This code should be improved to check and compare if the policy actual changes and only apply it if that is true. (set_ha_mode, set_policy)
We could also consider having the restarts on secondary nodes take note of if the cluster was stable when the hook started, and then if it is not longer stable once the delay has passed to wait a bit longer (maybe 2-3 minutes) for the cluster to stabilise before doing it's own local restart as the 30 seconds seem often not to be enough. Or otherwise implement some other kind of restart synchronisation mechanism or increase the default 30 second known-wait.
Action Required: Existing bug https:/
(6) Queue crashes on startup
Another user documented a queue crash on startup and proposed a fix. This fix was not accepted mostly due to them not proposing a revised fix. However this issue seems like it is solved possibly in Bionic’s 3.8.2 (and broken in 3.6.10) but I could not locate the bug or commit that seemed to affect the same code area. More research is required.
https:/
Action Required: Determine which commit fixes this and consider backport
(7) There are some possible erlang related bugs
https:/
16.04 has Erlang OTP 18 (Xenial)
18.04 has Erlang OTP 20 (Bionic)
20.04 has Erlang OTP 22 (Focal)
21.04 has Erlang OTP 23 (Hirsute)
There is also 2 other known bugs in Erlang 21 that we may be able to fix:
<A TCP related bug, link to be added later>
<Another bug, will find link later>
<Consider reviewing all bugs in the OTP stable point releases for relevant bugs>
Action Required: Research related Erlang bugs further and consider backporting the fixes.
(8) There is no nagios check for this failure
Because the problems here are due to some of the queues being "stuck", the existing checks for cluster partitions do not detect the failure. Additionally the checks that create and send messages on a queue are not sufficient as a random subset of the queues get affected by this issue in my testing and so the nagios test queue may or may not work.
It is possible to detect this situation reliably using the Management API, unfortunately that is not enabled by default. The existing RabbitMQ Partitions nagios check also depends on management_
I have drafted a new nagios check that uses the management API to check all of the queues are (a) Alive and not stuck, (b) Actually synchronised to all 3 nodes, (c) Having an active master. This reliably detects the issue in all cases I was able to reproduce. So at worst manual intervention can rescue the cluster.
Action Required: Enable management_plugin by default, implement new nagios check.
Tracked in the following 2 bugs
“Nagios does not detect queues which are not running or have no master”
https:/
"check_
https:/
Note: The management plugin appears to cause RabbitMQ to hang on Xenial-Queens specifically. It may not be possible to implement this fix for Xenial-Queens unless that is fixed. More information in the above bugs.
(9) queue_master_
The charm recently added support for queue_master_
Fixed in 3.7.5 upstream. Not backported to 3.6.x upstream.
https:/
https:/
Action Required: Backport the fix
(10) Revise the cluster-
There had been debate and multiple changes to the default cluster-
For example the switch to autoheal was done in this bug, describing the exact same symptoms I describe here, but now have shown are not related to the cluster-
https:/
This has led to a silly situation where there has been a refusal to again change the default, but the primary user of the charm (new openstack deployments) is overriding the default anyway.
Now that we have a thorough understanding of the related issues we may have enough data and justification to actually revise the default again and justify it with some further data and testing. That default should most likely be pause_minority.
Action Required: Raise a bug to reconsider the cluster-
summary: |
RabbitMQ queues often hang after charm config changes or rabbitmq - restarts due to multiple causes + restarts due to multiple causes (overview/co-ordination bug) |
Changed in charm-rabbitmq-server: | |
assignee: | nobody → Trent Lloyd (lathiat) |
Thank you for taking the time to do such a thorough analysis of these RabbitMQ issues.