rabbitmq server fails to start after cluster reboot

Bug #1828988 reported by Jason Hobbs
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
High
Nicolas Bock

Bug Description

After rebooting an entire fcb cluster (shutdown -r on all nodes), my rabbitmq cluster failed to come back up.

rabbitmqctl cluster_status:

http://paste.ubuntu.com/p/hh4GV2BJ8R/

juju status for rabbitmq-server:
http://paste.ubuntu.com/p/ptrJSrHGkG/

bundle:
http://paste.ubuntu.com/p/k35TTVp3Ps/

Reproducer 1 (tested on charm rev 102):
Results in:
Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/2 waiting idle 2 10.5.0.13 5672/tcp Unit has peers, but RabbitMQ not clustered
rabbitmq-server/3 error idle 3 10.5.0.4 5672/tcp hook failed: "cluster-relation-changed"
rabbitmq-server/4* error idle 4 10.5.0.20 5672/tcp hook failed: "update-status"

Howto:
juju deploy -n 3 --config min-cluster-size=3 rabbitmq-server
juju wait (may need snap install juju-wait first)
openstack server stop juju-98eb54-default-4 juju-98eb54-default-3 juju-98eb54-default-2
openstack server start juju-98eb54-default-4; sleep 150; openstack server start juju-98eb54-default-3; sleep 150; openstack server start juju-98eb54-default-2

As mentioned in comment there maybe multiple timings that can cause this failure.

description: updated
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I cannot reproduce this. I have documented the steps I took to try to reproduce this in the attached log

Changed in charm-rabbitmq-server:
status: New → Incomplete
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Can you provide a minimal reproducing bundle and steps to reproduce?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here is the tarball.

http://people.canonical.com/~jhobbs/juju-crashdump-openstack-2019-05-14-09.44.41.tar.gz

I linked to a bundle in the bug description.

I deployed the bundle, waited for everything to settle with juju wait.

Then, I ran juju run --all -m foundations-maas:openstack "sudo shutdown -r 1", and waited for all the machines to come back up.

Changed in charm-rabbitmq-server:
status: Incomplete → New
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I have partially reproduced this with a much smaller bundle:

series: bionic
applications:
  rabbitmq-server:
    charm: cs:rabbitmq-server-89
    num_units: 3
    options:
      min-cluster-size: 3

I only saw this error on one of the three units, making me think that this may be an issue where the units coming up take too long to see each other, so the first one to come up gives up and dies.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

- MachineId: "0"
  ReturnCode: 69
  Stderr: |+
    Error: unable to connect to node 'rabbit@juju-dad1a1-rabbit1-0': nodedown

    DIAGNOSTICS
    ===========

    attempted to contact: ['rabbit@juju-dad1a1-rabbit1-0']

    rabbit@juju-dad1a1-rabbit1-0:
      * connected to epmd (port 4369) on juju-dad1a1-rabbit1-0
      * epmd reports: node 'rabbit' not running at all
                      no other nodes on juju-dad1a1-rabbit1-0
      * suggestion: start the node

    current node details:
    - node name: 'rabbitmq-cli-24@juju-dad1a1-rabbit1-0'
    - home dir: /var/lib/rabbitmq
    - cookie hash: 5OSqxVkVzK1bM9XxGU2uPw==

  Stdout: |
    Cluster status of node 'rabbit@juju-dad1a1-rabbit1-0'
- MachineId: "2"
  Stdout: |
    Cluster status of node 'rabbit@juju-dad1a1-rabbit1-2'
    [{nodes,[{disc,['rabbit@juju-dad1a1-rabbit1-0','rabbit@juju-dad1a1-rabbit1-1',
                    'rabbit@juju-dad1a1-rabbit1-2']}]},
     {running_nodes,['rabbit@juju-dad1a1-rabbit1-1',
                     'rabbit@juju-dad1a1-rabbit1-2']},
     {cluster_name,<<"rabbit@juju-dad1a1-rabbit1-0">>},
     {partitions,[]},
     {alarms,[{'rabbit@juju-dad1a1-rabbit1-1',[]},
              {'rabbit@juju-dad1a1-rabbit1-2',[]}]}]
- MachineId: "1"
  Stdout: |
    Cluster status of node 'rabbit@juju-dad1a1-rabbit1-1'
    [{nodes,[{disc,['rabbit@juju-dad1a1-rabbit1-0','rabbit@juju-dad1a1-rabbit1-1',
                    'rabbit@juju-dad1a1-rabbit1-2']}]},
     {running_nodes,['rabbit@juju-dad1a1-rabbit1-2',
                     'rabbit@juju-dad1a1-rabbit1-1']},
     {cluster_name,<<"rabbit@juju-dad1a1-rabbit1-0">>},
     {partitions,[]},
     {alarms,[{'rabbit@juju-dad1a1-rabbit1-2',[]},
              {'rabbit@juju-dad1a1-rabbit1-1',[]}]}]

Changed in charm-rabbitmq-server:
status: New → Confirmed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Model Controller Cloud/Region Version SLA Timestamp
rabbit1 icey-serverstack serverstack/serverstack 2.5.4 unsupported 11:31:16Z

App Version Status Scale Charm Store Rev OS Notes
rabbitmq-server 3.6.10 error 3 rabbitmq-server jujucharms 89 ubuntu

Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/0* error idle 0 10.5.0.4 5672/tcp hook failed: "update-status"
rabbitmq-server/1 active idle 1 10.5.0.18 5672/tcp Unit is ready and clustered
rabbitmq-server/2 error idle 2 10.5.0.15 5672/tcp hook failed: "leader-settings-changed"

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.4 0ce1627c-8950-4ee7-b8a1-ee9f289e55f4 bionic nova ACTIVE
1 started 10.5.0.18 b3b6edc9-3910-49f2-907e-3722c5a4d7c5 bionic nova ACTIVE
2 started 10.5.0.15 f9e34a91-7bb8-40c0-b937-710ab1af74f0 bionic nova ACTIVE

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

It's interesting to see that Juju had a hook failure (unit /2) even though the rabbit server ended up happily clustered.

tags: added: reboot-fail
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Might be related to the fact that juju provisions machines with manage_etc_hosts set to True so any changes made by the charm to /etc/hosts will be overridden on reboot.

grep -RiP manage_etc_hosts
cloudconfig/cloudinit/cloudinit.go: cfg.SetAttr("manage_etc_hosts", true)
cloudconfig/cloudinit/cloudinit.go: cfg.UnsetAttr("manage_etc_hosts")
cloudconfig/cloudinit/cloudinit_test.go: map[string]interface{}{"manage_etc_hosts": true},

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Seeing something very similar in a current deployment. Why is this related to management of hosts files?

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Indeed, once we manually specified the hostnames/IPs on all the rabbitmq nodes (for the other ones), it is now back up.

One node has the hosts file correct before hand. So clearly the info is there, why isn't Juju making it work for the other nodes?

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

1. Deploy using bundle to an openstack cloud (I used stsstack)

1828988.yaml
series: bionic
applications:
  rabbitmq-server:
    charm: cs:rabbitmq-server-89
    num_units: 3
    options:
      min-cluster-size: 3

juju deploy ./1828988.yaml

2. Wait until everyting is setup watch --color -n 5 juju status --color

3. openstack server stop juju-bae233-default-0 juju-bae233-default-1 juju-bae233-default-2
4. wait for all to be shutoff in watch -n 5 openstack server list
5. Do something like: openstack server start juju-bae233-default-0; sleep 150; openstack server start juju-bae233-default-1; sleep 150; openstack server start juju-bae233-default-2

It usually fails with 2 nodes in a update-stauts hook failed. I've had failures from 80 - 150 seconds, but one time it came back fine at 120 seconds, and one time at 112 seconds it also failed with a leader hook

How to recover:
openstack server stop juju-bae233-default-0 juju-bae233-default-1 juju-bae233-default-2
wait for all to be shutoff in watch -n 5 openstack server list
openstack server start juju-bae233-default-0 juju-bae233-default-1 juju-bae233-default-2

tags: added: cold-start
tags: added: sts
Changed in charm-rabbitmq-server:
assignee: nobody → Nicolas Bock (nicolasbock)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.opendev.org/716619

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/716776

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Removed rabbitmq-server package as the bug is related to the charm itself.

no longer affects: rabbitmq-server (Ubuntu)
no longer affects: rabbitmq-server (Ubuntu Xenial)
no longer affects: rabbitmq-server (Ubuntu Bionic)
no longer affects: rabbitmq-server (Ubuntu Eoan)
no longer affects: rabbitmq-server (Ubuntu Focal)
no longer affects: rabbitmq-server (Ubuntu)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/716619
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=ac1bc43ba901d9170137eeaee1e142e6bbd36cbb
Submitter: Zuul
Branch: master

commit ac1bc43ba901d9170137eeaee1e142e6bbd36cbb
Author: Nicolas Bock <email address hidden>
Date: Wed Apr 1 07:30:16 2020 -0600

    Add `force-boot` action

    This change adds a `force-boot` action which sets the `force_boot`
    flag and restarts the RabbitMQ broker. This action can be used if a
    broker refuses to start because the master of a queue is not
    available.

    Also add appropriate unit tests.

    Change-Id: I8b01d1d668e18116c7f8b1fc56f197620a10c91f
    Partial-Bug: #1828988
    Signed-off-by: Nicolas Bock <email address hidden>

Revision history for this message
James Page (james-page) wrote :

Other than the action already committed is there any further action required on this bug report?

Changed in charm-rabbitmq-server:
importance: Undecided → High
status: In Progress → Incomplete
Revision history for this message
James Page (james-page) wrote :

The charm deployment guide has a specific section on rabbitmq restarts - maybe we need to detail something about the action there:

  https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-managing-power-events.html#rabbitmq-server

Changed in charm-rabbitmq-server:
milestone: none → 20.08
James Page (james-page)
Changed in charm-rabbitmq-server:
milestone: 20.08 → none
Changed in charm-rabbitmq-server:
status: Incomplete → In Progress
Revision history for this message
Nicolas Bock (nicolasbock) wrote :

I have added an update to the deployment guide [1]. I believe that this is sufficient to close the bug.

[1] https://review.opendev.org/c/openstack/charm-deployment-guide/+/785727

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/716776
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/8015d9a365410efa5ef6d952ed33364b1679c0ef
Submitter: "Zuul (22348)"
Branch: master

commit 8015d9a365410efa5ef6d952ed33364b1679c0ef
Author: Nicolas Bock <email address hidden>
Date: Wed Apr 1 14:56:16 2020 -0600

    Add config parameters to tune mnesia settings

    When a RabbitMQ cluster is restarted, the mnesia settings determine
    how long and how often each broker will try to connect to the cluster
    before giving up. It might be useful for an operator to be able to
    tune these parameters. This change adds two settings,
    `mnesia-table-loading-retry-timeout` and
    `mnesia-table-loading-retry-limit`, which set these parameters in the
    rabbitmq.config file [1].

    [1] https://www.rabbitmq.com/configure.html#config-items

    Change-Id: I96aa8c4061aed47eb2e844d1bec44fafd379ac25
    Partial-Bug: #1828988
    Related-Bug: #1874075
    Co-authored-by: Nicolas Bock <email address hidden>
    Co-authored-by: Aurelien Lourot <email address hidden>

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

It seems like all the mentioned reviews together may be enough to solve this issue. Closing for now. Feel free to re-open.

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
milestone: none → 21.10
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-rabbitmq-server (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/732452
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.