OpenStack RabbitMQ Server Charm

rabbitmq-server died abruptly

Bug #1747347 reported by Tejeev Patel on 2018-02-05

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack RabbitMQ Server Charm	Triaged	High	Unassigned

Bug Description

We saw a rabbimq-server unit which is the leader in a cluster with 2 others die abruptly. Cluster status on other units remained ok, except for inability to communicate with the downed node. We were unable to determine a root cause. We've seen this happen on 3 other occasions and on one we observed that before starting the rabbit service the jujud agent was in error state; afterwards it was in executing state. I do not know if that was the case this time.

== LOGS ETC. ==

Service status after failure:

https://pastebin.canonical.com/p/nbcGBmr2b5/

================================

From juju unit log, I see repetitions similar to the first section (02:05:42-02:05:44) going back for days. At the time of the failure 02:05:45, it abruptly cuts away to reporting inability to connect. This section of enteries (02:05:45-02:10:49) basically repeats until the service is manually restarted around 02:46:18, where it goes back to the WARNING and entries resembling the first section:

https://pastebin.canonical.com/p/ZGjtM647JZ/

================================

In the rabbit logs (/<email address hidden>) we see it abruptly stop logging at the time of death and then start up again with the unit starting up at recovery:

https://pastebin.canonical.com/p/vd5kj3CvDf/

================================

syslog from time of failure to time of recovery:

https://pastebin.canonical.com/p/pVpFTVHXC3/

See original description

Tags:

Revision history for this message

Tejeev Patel (tejeevpatel) wrote on 2018-02-05:

This cloud is running juju 2.2.4-xenial-amd64

Alvaro Uria (aluria) on 2018-05-30

tags:

added: canonical-bootstack

Revision history for this message

Paul Gear (paulgear) wrote on 2018-08-07:

Moved logs to pastebins for easier perusal.

@openstack-charmers, any chance of triage and suggestions for next steps?

description:

updated

Revision history for this message

Paul Gear (paulgear) wrote on 2018-08-07:

Further to the above, this problem is still occurring and causing rabbitmq crashes. Juju unit logs show this on leader: https://pastebin.canonical.com/p/m4WcRrnhDr/ and this on non-leaders: https://pastebin.canonical.com/p/hs9NKfZ59z/

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-13:

Is it possible to get a reproducer bundle? Alternately, would a bundle like:

```
applications:
  rabbit:
    charm: cs:openstack-charmers-next/rabbitmq-server
    num_units: 3
```

be enough of a reproducer? Without a reproducer, it'll be difficult to track down what this issue could be, and whether it is a potential charm or upstream issue.

Looking briefly through the Juju unit logs, I notice the line:

min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.

In a clustered deployment, the min-cluster-size configuration option is fairly important, as it communicates to the charms what to expect with regards to cluster sizing. I don't think that it would cause random failures as seen here but I'm not 100% sure of the implication of having that incorrect post-bootstrap.

Changed in charm-rabbitmq-server:
importance:	Undecided → High
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.