check_rabbitmq_cluster partition check is not enabled by default (due to management_plugin=false)

Bug #1930547 reported by Trent Lloyd
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
In Progress
Undecided
Trent Lloyd

Bug Description

The check_rabbitmq_cluster NRPE check which checks for cluster partitions is not enabled by default, because it depends on management_plugin=true which is also not enabled by default. I proposed that we resolve this by enabling management_plugin by default.

= Justification/Review of that change =

Partitions are a frequent source of problems in deployments, especially since the default cluster_partition_handling=ignore. They will not self resolve, are not otherwise visible (including in juju status) and in the most frequently used deployment (OpenStack) results in weird and hard to diagnose service failures such as VMs or Networks just not getting connected or fully working, but partly working.

I looked into why this check depends on the management_plugin, which is because the nagios checks run as the 'nrpe' user which does not have access to run 'rabbitmqctl cluster_status'. So the Management API gives a HTTP API to make the request from the unprivileged user to get the same info. Someone did contribute an alternative that runs the cluster_status with cron to output a file that is then read by the nrpe check but it was abandoned and never reviewed (Bug #1548679, https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/538319)

The management API is also useful to have enabled generally, as you can get some good statistics and information from it, to help with support cases including which queues/users are busy, etc, and we have sometimes wanted it in the course of a support case. The API is available to the network over HTTP and currently does not have (at least charmed) SSL support however it does setup authenticated users with a random password, and the user only has 'monitoring' access and there are no users with the administrator tag created by default.

So I think it's safe and sensible to enable by default, but it will result in an extra network service appearing after a charm upgrade, so I guess that should be considered. But I think over-all it would be a positive change especially as the API is otherwise useful and the check is really quite critical.

Tags: seg sts
Revision history for this message
Trent Lloyd (lathiat) wrote :

I'm happy to propose the Merge request to toggle the change, but wanted input if there are any objections to doing so.

tags: added: seg sts
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Is this a dup of https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1930417 ? Perhaps we could merge them?

Revision history for this message
Liam Young (gnuoy) wrote :

It's not a dupe in my opinion. I think the charm should check for partitions but I don't think that the charms check should depend on whether or not the management plugin is enabled or if the nrpe check check_rabbitmq_cluster is working. I think the charms check should make a direct call to `rabbitmqctl cluster_status`

Revision history for this message
Drew Freiberger (afreiberger) wrote (last edit ):

As a reminder, management plugin crashes queens+ versions of rabbitmq when running in a cluster.

It would be good to test management plugin in a large environment with the new Focal version.

https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1783203

Revision history for this message
Trent Lloyd (lathiat) wrote :

I suspect the above might only be true on Xenial.. it has seemed to work OK on Bionic.

Seems that erlang is not backported into the xenial-queens repo (but rabbitmq-server is) so this may be a notable part.

But at the very least we may need to conditonalise this default against Bionic, if that turns out to be true with testing. So thanks for the note as I was not aware of that bug.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Very good point. I know the environment where we were able to reproduce the RMQ crash by enabling the mgmt plugin. It was on Xenial. It is now upgraded to bionic. I'm sure we could test the effect of bionic + management plugin to ensure that the issue is no present on modern operating systems. If that is the case, we could update the charm to disable management_plugin on Xenial/Trusty, but allow it to be enabled on Bionic, which would then allow for re-implementation of the original monitoring solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)
Changed in charm-rabbitmq-server:
status: New → In Progress
Trent Lloyd (lathiat)
Changed in charm-rabbitmq-server:
assignee: nobody → Trent Lloyd (lathiat)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/819134
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/df711c6717fffdd6d4285b5b2b318ead90fa51fa
Submitter: "Zuul (22348)"
Branch: master

commit df711c6717fffdd6d4285b5b2b318ead90fa51fa
Author: Liam Young <email address hidden>
Date: Wed Nov 24 15:46:36 2021 +0000

    Switch to enabling the managment plugin by default

    Over time the managment plugin has become a core part of managing
    a rabbit deployment. This includes allowing tools such as nrpe to
    be able to query the api and alert for situations such as orphaned
    queues.

    Change-Id: Icbf760610ce83b9d95f48e99f6607ddf23963c97
    Partial-Bug: 1930547

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.