Nagios does not detect queues which are not running or have no master
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
New
|
Undecided
|
Trent Lloyd |
Bug Description
There are a number of scenarios that lead to queues either having no promoted master or being hung or stuck in some way due to a variety of factors. These queues are not usable by the clients (they will get a timeout) but existing nagios checks do not detect this failure.
This is one of a number of resolutions identified in Bug #1943937 "RabbitMQ queues often hang after charm config changes or rabbitmq restarts due to multiple causes (overview/
There is no cluster partition reported, and only a random subset of the queues get affected by this issue so the checks that create and send messages on a test queue are not sufficient to monitor the health of all queues.
The "rabbitmqctl list_queues" command unfortunately totally excludes these stuck queues from it's list entirely. The only way I have found to detect this scenario is via the Management API which is exposed as a REST API and can be consumed directly (such as the existing check_rabbitmq_
[Related Bugs]
"check_
https:/
"Upgrade to RabbitMQ 3.6.10 causes beam lockup in clustered deployment"
https:/
It seems the management plugin may be broken on Xenial specifically, hanging RabbitMQ if it is enabled. The xenial-queens cloud-archive has a backported rabbitmq-server 3.6.10 from Bionic however uses the Erlang OTP version 18 from Xenial not the Erlang OTP version 20 from Bionic. This needs to be resolved or the default enablement may need to be conditional upon Bionic.
[Test Case]
This issue is best reproduced on Bionic 18.04. It is harder to reproduce on Focal 20.04 due to a number of bug fixes but still possible particularly if you also have network partitions.
Generally speaking, restarting all 3 servers at approximately the same time is likely to trigger the issue. In some cases, especially where a cluster partition had previously occurred (even days ago), restarting only 1 or 2 of the servers may also trigger the situation.
I found the following scenario to reliably reproduce it most of the time I attempt it:
(in parallel at the same time)
rabbitmq-server/0: sudo systemctl restart rabbitmq-server
rabbitmq-server/1: sudo systemctl restart rabbitmq-server
(as soon as one of the above restarts returns to the command prompt)
rabbitmq-server/2: sudo systemctl restart rabbitmq-server
Depending on the speed of the underlying server and number of queues created (a basic openstack install seems to have around 500 queues), you may need to experiment a little with the exact timing. It can be reproduced with all cluster-
Changing one of the charm config options causes the charm to also do such a rolling restart and is also likely to reproduce the issue. The default 30 second known-wait between restarts makes it slightly less reliable to reproduce than the above but still happens but can depend a little on the speed and size of your environment. It’s a bit racy.
[Symptoms]
Some or all of the following occurs
(a) A random subset of the queues will hang. Queues still exist but disappear from the "rabbitmqctl list_queues -p openstack" output. They can only be seen if you enable the management plugin and query with rabbitmqadmin or use the web interface. Then you will see the queues are listed but with no statistics or mirrors. Basically just only the name is listed.
(b) Clients fail to use or declare the queue. This action times out after 30 seconds and logs the following error on the server side:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'q-bgp-plugin' in vhost 'openstack' due to timeout
(c) Errors like the following are logged to <email address hidden>
=ERROR REPORT==== 17-Sep-
Discarding message {'$gen_
(d) The queue has no active master due to the default ha-promote-
=WARNING REPORT==== 16-Sep-
Mirrored queue 'test_71' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
(e) In some cases the queue is alive, however, fails to synchronise 1 or more of the slaves and it now has reduced redundancy running on only 2 or 1 node.
[Resolution]
I have drafted a new nagios check that uses the management API to check all of the queues are (a) Alive and not stuck, (b) Having an active master and (c) Actually synchronised to all 3 nodes. This reliably detects the issue in all cases I was able to reproduce.
Since it consumes the same information as the partition and node health checks in check_rabbitmq_
I will follow-up with a changeset implementing this new check. We also need to look at enabling the management_plugin by default in Bug #1930547
Changed in charm-rabbitmq-server: | |
assignee: | nobody → Trent Lloyd (lathiat) |
tags: | added: sts |
description: | updated |