OpenStack RabbitMQ Server Charm

check_rabbitmq_cluster partition check is not enabled by default (due to management_plugin=false)

Bug #1930547 reported by Trent Lloyd on 2021-06-02

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack RabbitMQ Server Charm	In Progress	Undecided	Trent Lloyd

Bug Description

The check_rabbitmq_cluster NRPE check which checks for cluster partitions is not enabled by default, because it depends on management_plugin=true which is also not enabled by default. I proposed that we resolve this by enabling management_plugin by default.

= Justification/Review of that change =

Partitions are a frequent source of problems in deployments, especially since the default cluster_partition_handling=ignore. They will not self resolve, are not otherwise visible (including in juju status) and in the most frequently used deployment (OpenStack) results in weird and hard to diagnose service failures such as VMs or Networks just not getting connected or fully working, but partly working.

I looked into why this check depends on the management_plugin, which is because the nagios checks run as the 'nrpe' user which does not have access to run 'rabbitmqctl cluster_status'. So the Management API gives a HTTP API to make the request from the unprivileged user to get the same info. Someone did contribute an alternative that runs the cluster_status with cron to output a file that is then read by the nrpe check but it was abandoned and never reviewed (Bug #1548679, https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/538319)

The management API is also useful to have enabled generally, as you can get some good statistics and information from it, to help with support cases including which queues/users are busy, etc, and we have sometimes wanted it in the course of a support case. The API is available to the network over HTTP and currently does not have (at least charmed) SSL support however it does setup authenticated users with a random password, and the user only has 'monitoring' access and there are no users with the administrator tag created by default.

So I think it's safe and sensible to enable by default, but it will result in an extra network service appearing after a charm upgrade, so I guess that should be considered. But I think over-all it would be a positive change especially as the API is otherwise useful and the check is really quite critical.

Tags:

Revision history for this message

Trent Lloyd (lathiat) wrote on 2021-06-02:

#1

I'm happy to propose the Merge request to toggle the change, but wanted input if there are any objections to doing so.

tags:

added: seg sts

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-06-02:

#2

Is this a dup of https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1930417 ? Perhaps we could merge them?

Revision history for this message

Liam Young (gnuoy) wrote on 2021-06-17:

#3

It's not a dupe in my opinion. I think the charm should check for partitions but I don't think that the charms check should depend on whether or not the management plugin is enabled or if the nrpe check check_rabbitmq_cluster is working. I think the charms check should make a direct call to `rabbitmqctl cluster_status`

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2021-07-06 (last edit on 2021-07-06):

#4

As a reminder, management plugin crashes queens+ versions of rabbitmq when running in a cluster.

It would be good to test management plugin in a large environment with the new Focal version.

https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1783203

Revision history for this message

Trent Lloyd (lathiat) wrote on 2021-09-16:

#5

I suspect the above might only be true on Xenial.. it has seemed to work OK on Bionic.

Seems that erlang is not backported into the xenial-queens repo (but rabbitmq-server is) so this may be a notable part.

But at the very least we may need to conditonalise this default against Bionic, if that turns out to be true with testing. So thanks for the note as I was not aware of that bug.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2021-10-20:

#6

Very good point. I know the environment where we were able to reproduce the RMQ crash by enabling the mgmt plugin. It was on Xenial. It is now upgraded to bionic. I'm sure we could test the effect of bionic + management plugin to ensure that the issue is no present on modern operating systems. If that is the case, we could update the charm to disable management_plugin on Xenial/Trusty, but allow it to be enabled on Bionic, which would then allow for re-implementation of the original monitoring solution.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-24: Fix proposed to charm-rabbitmq-server (master)

#7

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/819134

Changed in charm-rabbitmq-server:
status:	New → In Progress

Trent Lloyd (lathiat) on 2021-12-01

Changed in charm-rabbitmq-server:
assignee:	nobody → Trent Lloyd (lathiat)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-04: Fix merged to charm-rabbitmq-server (master)

#8

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/819134
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/df711c6717fffdd6d4285b5b2b318ead90fa51fa
Submitter: "Zuul (22348)"
Branch: master

commit df711c6717fffdd6d4285b5b2b318ead90fa51fa
Author: Liam Young <email address hidden>
Date: Wed Nov 24 15:46:36 2021 +0000

Switch to enabling the managment plugin by default

    Over time the managment plugin has become a core part of managing
    a rabbit deployment. This includes allowing tools such as nrpe to
    be able to query the api and alert for situations such as orphaned
    queues.

Change-Id: Icbf760610ce83b9d95f48e99f6607ddf23963c97
Partial-Bug: 1930547

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.