Nagios does not detect queues which are not running or have no master

Bug #1943936 reported by Trent Lloyd
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
New
Undecided
Trent Lloyd

Bug Description

There are a number of scenarios that lead to queues either having no promoted master or being hung or stuck in some way due to a variety of factors. These queues are not usable by the clients (they will get a timeout) but existing nagios checks do not detect this failure.

This is one of a number of resolutions identified in Bug #1943937 "RabbitMQ queues often hang after charm config changes or rabbitmq restarts due to multiple causes (overview/co-ordination bug)"

There is no cluster partition reported, and only a random subset of the queues get affected by this issue so the checks that create and send messages on a test queue are not sufficient to monitor the health of all queues.

The "rabbitmqctl list_queues" command unfortunately totally excludes these stuck queues from it's list entirely. The only way I have found to detect this scenario is via the Management API which is exposed as a REST API and can be consumed directly (such as the existing check_rabbitmq_cluster does) or by the rabbitmqadmin command. Unfortunately the management API is not enabled by default. The existing check_rabbitmq_cluster nagios check also depends on management_plugin=true and is not enabled by default so we also have no nagios reporting of partitions by default either.

[Related Bugs]
"check_rabbitmq_cluster partition check is not enabled by default (due to management_plugin=false)”
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1930547

"Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment"
https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1783203

It seems the management plugin may be broken on Xenial specifically, hanging RabbitMQ if it is enabled. The xenial-queens cloud-archive has a backported rabbitmq-server 3.6.10 from Bionic however uses the Erlang OTP version 18 from Xenial not the Erlang OTP version 20 from Bionic. This needs to be resolved or the default enablement may need to be conditional upon Bionic.

[Test Case]
This issue is best reproduced on Bionic 18.04. It is harder to reproduce on Focal 20.04 due to a number of bug fixes but still possible particularly if you also have network partitions.

Generally speaking, restarting all 3 servers at approximately the same time is likely to trigger the issue. In some cases, especially where a cluster partition had previously occurred (even days ago), restarting only 1 or 2 of the servers may also trigger the situation.

I found the following scenario to reliably reproduce it most of the time I attempt it:

(in parallel at the same time)
rabbitmq-server/0: sudo systemctl restart rabbitmq-server
rabbitmq-server/1: sudo systemctl restart rabbitmq-server

(as soon as one of the above restarts returns to the command prompt)
rabbitmq-server/2: sudo systemctl restart rabbitmq-server

Depending on the speed of the underlying server and number of queues created (a basic openstack install seems to have around 500 queues), you may need to experiment a little with the exact timing. It can be reproduced with all cluster-partition-handling settings though some settings change exactly how reproducible it is by which timing.

Changing one of the charm config options causes the charm to also do such a rolling restart and is also likely to reproduce the issue. The default 30 second known-wait between restarts makes it slightly less reliable to reproduce than the above but still happens but can depend a little on the speed and size of your environment. It’s a bit racy.

[Symptoms]

Some or all of the following occurs

(a) A random subset of the queues will hang. Queues still exist but disappear from the "rabbitmqctl list_queues -p openstack" output. They can only be seen if you enable the management plugin and query with rabbitmqadmin or use the web interface. Then you will see the queues are listed but with no statistics or mirrors. Basically just only the name is listed.

(b) Clients fail to use or declare the queue. This action times out after 30 seconds and logs the following error on the server side:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-bgp-plugin' in vhost 'openstack' due to timeout

(c) Errors like the following are logged to <email address hidden>
=ERROR REPORT==== 17-Sep-2021::06:27:34 ===
Discarding message {'$gen_call',{<0.12580.0>,#Ref<0.2898216055.2000945153.142860>},stat} from <0.12580.0> to <0.2157.0> in an old incarnation (2) of this node (3)

(d) The queue has no active master due to the default ha-promote-on-shutdown=when-synced policy

=WARNING REPORT==== 16-Sep-2016::10:32:57 ===
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

(e) In some cases the queue is alive, however, fails to synchronise 1 or more of the slaves and it now has reduced redundancy running on only 2 or 1 node.

[Resolution]

I have drafted a new nagios check that uses the management API to check all of the queues are (a) Alive and not stuck, (b) Having an active master and (c) Actually synchronised to all 3 nodes. This reliably detects the issue in all cases I was able to reproduce.

Since it consumes the same information as the partition and node health checks in check_rabbitmq_cluster I have expanded that check to implement this functionality.

I will follow-up with a changeset implementing this new check. We also need to look at enabling the management_plugin by default in Bug #1930547

Tags: sts
Trent Lloyd (lathiat)
Changed in charm-rabbitmq-server:
assignee: nobody → Trent Lloyd (lathiat)
Trent Lloyd (lathiat)
tags: added: sts
Trent Lloyd (lathiat)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.