OpenStack RabbitMQ Server Charm

rabbitmq server fails to start after cluster reboot

Bug #1828988 reported by Jason Hobbs on 2019-05-14

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack RabbitMQ Server Charm	Fix Released	High	Nicolas Bock	OpenStack RabbitMQ Server Charm 21.10

Bug Description

After rebooting an entire fcb cluster (shutdown -r on all nodes), my rabbitmq cluster failed to come back up.

rabbitmqctl cluster_status:

http://paste.ubuntu.com/p/hh4GV2BJ8R/

juju status for rabbitmq-server:
http://paste.ubuntu.com/p/ptrJSrHGkG/

bundle:
http://paste.ubuntu.com/p/k35TTVp3Ps/

Reproducer 1 (tested on charm rev 102):
Results in:
Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/2 waiting idle 2 10.5.0.13 5672/tcp Unit has peers, but RabbitMQ not clustered
rabbitmq-server/3 error idle 3 10.5.0.4 5672/tcp hook failed: "cluster-relation-changed"
rabbitmq-server/4* error idle 4 10.5.0.20 5672/tcp hook failed: "update-status"

Howto:
juju deploy -n 3 --config min-cluster-size=3 rabbitmq-server
juju wait (may need snap install juju-wait first)
openstack server stop juju-98eb54-default-4 juju-98eb54-default-3 juju-98eb54-default-2
openstack server start juju-98eb54-default-4; sleep 150; openstack server start juju-98eb54-default-3; sleep 150; openstack server start juju-98eb54-default-2

As mentioned in comment there maybe multiple timings that can cause this failure.

See original description

Tags:

Jason Hobbs (jason-hobbs) on 2019-05-14

description:

updated

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-14:

plain Edit (4.4 KiB, text/plain)

I cannot reproduce this. I have documented the steps I took to try to reproduce this in the attached log

Changed in charm-rabbitmq-server:
status:	New → Incomplete

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-14:

Can you provide a minimal reproducing bundle and steps to reproduce?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-05-14:

Here is the tarball.

http://people.canonical.com/~jhobbs/juju-crashdump-openstack-2019-05-14-09.44.41.tar.gz

I linked to a bundle in the bug description.

I deployed the bundle, waited for everything to settle with juju wait.

Then, I ran juju run --all -m foundations-maas:openstack "sudo shutdown -r 1", and waited for all the machines to come back up.

Changed in charm-rabbitmq-server:
status:	Incomplete → New

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-14:

I have partially reproduced this with a much smaller bundle:

series: bionic
applications:
  rabbitmq-server:
    charm: cs:rabbitmq-server-89
    num_units: 3
    options:
      min-cluster-size: 3

I only saw this error on one of the three units, making me think that this may be an issue where the units coming up take too long to see each other, so the first one to come up gives up and dies.

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-14:

- MachineId: "0"
  ReturnCode: 69
  Stderr: |+
    Error: unable to connect to node 'rabbit@juju-dad1a1-rabbit1-0': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@juju-dad1a1-rabbit1-0']

    rabbit@juju-dad1a1-rabbit1-0:
      * connected to epmd (port 4369) on juju-dad1a1-rabbit1-0
      * epmd reports: node 'rabbit' not running at all
                      no other nodes on juju-dad1a1-rabbit1-0
      * suggestion: start the node

    current node details:
    - node name: 'rabbitmq-cli-24@juju-dad1a1-rabbit1-0'
    - home dir: /var/lib/rabbitmq
    - cookie hash: 5OSqxVkVzK1bM9XxGU2uPw==

  Stdout: |
    Cluster status of node 'rabbit@juju-dad1a1-rabbit1-0'
- MachineId: "2"
  Stdout: |
    Cluster status of node 'rabbit@juju-dad1a1-rabbit1-2'
    [{nodes,[{disc,['rabbit@juju-dad1a1-rabbit1-0','rabbit@juju-dad1a1-rabbit1-1',
                    'rabbit@juju-dad1a1-rabbit1-2']}]},
     {running_nodes,['rabbit@juju-dad1a1-rabbit1-1',
                     'rabbit@juju-dad1a1-rabbit1-2']},
     {cluster_name,<<"rabbit@juju-dad1a1-rabbit1-0">>},
     {partitions,[]},
     {alarms,[{'rabbit@juju-dad1a1-rabbit1-1',[]},
              {'rabbit@juju-dad1a1-rabbit1-2',[]}]}]
- MachineId: "1"
  Stdout: |
    Cluster status of node 'rabbit@juju-dad1a1-rabbit1-1'
    [{nodes,[{disc,['rabbit@juju-dad1a1-rabbit1-0','rabbit@juju-dad1a1-rabbit1-1',
                    'rabbit@juju-dad1a1-rabbit1-2']}]},
     {running_nodes,['rabbit@juju-dad1a1-rabbit1-2',
                     'rabbit@juju-dad1a1-rabbit1-1']},
     {cluster_name,<<"rabbit@juju-dad1a1-rabbit1-0">>},
     {partitions,[]},
     {alarms,[{'rabbit@juju-dad1a1-rabbit1-2',[]},
              {'rabbit@juju-dad1a1-rabbit1-1',[]}]}]

Changed in charm-rabbitmq-server:
status:	New → Confirmed

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-14:

Model Controller Cloud/Region Version SLA Timestamp
rabbit1 icey-serverstack serverstack/serverstack 2.5.4 unsupported 11:31:16Z

App Version Status Scale Charm Store Rev OS Notes
rabbitmq-server 3.6.10 error 3 rabbitmq-server jujucharms 89 ubuntu

Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/0* error idle 0 10.5.0.4 5672/tcp hook failed: "update-status"
rabbitmq-server/1 active idle 1 10.5.0.18 5672/tcp Unit is ready and clustered
rabbitmq-server/2 error idle 2 10.5.0.15 5672/tcp hook failed: "leader-settings-changed"

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.4 0ce1627c-8950-4ee7-b8a1-ee9f289e55f4 bionic nova ACTIVE
1 started 10.5.0.18 b3b6edc9-3910-49f2-907e-3722c5a4d7c5 bionic nova ACTIVE
2 started 10.5.0.15 f9e34a91-7bb8-40c0-b937-710ab1af74f0 bionic nova ACTIVE

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-14:

It's interesting to see that Juju had a hook failure (unit /2) even though the rabbit server ended up happily clustered.

Jason Hobbs (jason-hobbs) on 2019-05-14

tags:

added: reboot-fail

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-07-01:

Might be related to the fact that juju provisions machines with manage_etc_hosts set to True so any changes made by the charm to /etc/hosts will be overridden on reboot.

grep -RiP manage_etc_hosts
cloudconfig/cloudinit/cloudinit.go: cfg.SetAttr("manage_etc_hosts", true)
cloudconfig/cloudinit/cloudinit.go: cfg.UnsetAttr("manage_etc_hosts")
cloudconfig/cloudinit/cloudinit_test.go: map[string]interface{}{"manage_etc_hosts": true},

Revision history for this message

Bryan Quigley (bryanquigley) wrote on 2020-02-27:

Seeing something very similar in a current deployment. Why is this related to management of hosts files?

Revision history for this message

Bryan Quigley (bryanquigley) wrote on 2020-02-27:

#10

Indeed, once we manually specified the hostnames/IPs on all the rabbitmq nodes (for the other ones), it is now back up.

One node has the hosts file correct before hand. So clearly the info is there, why isn't Juju making it work for the other nodes?

Revision history for this message

Bryan Quigley (bryanquigley) wrote on 2020-03-13:

#11

1. Deploy using bundle to an openstack cloud (I used stsstack)

1828988.yaml
series: bionic
applications:
  rabbitmq-server:
    charm: cs:rabbitmq-server-89
    num_units: 3
    options:
      min-cluster-size: 3

juju deploy ./1828988.yaml

2. Wait until everyting is setup watch --color -n 5 juju status --color

3. openstack server stop juju-bae233-default-0 juju-bae233-default-1 juju-bae233-default-2
4. wait for all to be shutoff in watch -n 5 openstack server list
5. Do something like: openstack server start juju-bae233-default-0; sleep 150; openstack server start juju-bae233-default-1; sleep 150; openstack server start juju-bae233-default-2

It usually fails with 2 nodes in a update-stauts hook failed. I've had failures from 80 - 150 seconds, but one time it came back fine at 120 seconds, and one time at 112 seconds it also failed with a leader hook

How to recover:
openstack server stop juju-bae233-default-0 juju-bae233-default-1 juju-bae233-default-2
wait for all to be shutoff in watch -n 5 openstack server list
openstack server start juju-bae233-default-0 juju-bae233-default-1 juju-bae233-default-2

Alex Kavanagh (ajkavanagh) on 2020-03-14

tags:

added: cold-start

Dominique Poulain (dominique-poulain) on 2020-03-15

tags:

added: sts

Nicolas Bock (nicolasbock) on 2020-03-26

Changed in charm-rabbitmq-server:
assignee:	nobody → Nicolas Bock (nicolasbock)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-01: Fix proposed to charm-rabbitmq-server (master)

#12

Fix proposed to branch: master
Review: https://review.opendev.org/716619

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-01:

#13

Fix proposed to branch: master
Review: https://review.opendev.org/716776

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2020-04-02:

#14

Removed rabbitmq-server package as the bug is related to the charm itself.

no longer affects:	rabbitmq-server (Ubuntu)
no longer affects:	rabbitmq-server (Ubuntu Xenial)
no longer affects:	rabbitmq-server (Ubuntu Bionic)
no longer affects:	rabbitmq-server (Ubuntu Eoan)
no longer affects:	rabbitmq-server (Ubuntu Focal)

Nicolas Bock (nicolasbock) on 2020-04-22

no longer affects:

rabbitmq-server (Ubuntu)

Bryan Quigley (bryanquigley) on 2020-05-21

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-26: Fix merged to charm-rabbitmq-server (master)

#15

Reviewed: https://review.opendev.org/716619
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=ac1bc43ba901d9170137eeaee1e142e6bbd36cbb
Submitter: Zuul
Branch: master

commit ac1bc43ba901d9170137eeaee1e142e6bbd36cbb
Author: Nicolas Bock <email address hidden>
Date: Wed Apr 1 07:30:16 2020 -0600

Add `force-boot` action

    This change adds a `force-boot` action which sets the `force_boot`
    flag and restarts the RabbitMQ broker. This action can be used if a
    broker refuses to start because the master of a queue is not
    available.

Also add appropriate unit tests.

    Change-Id: I8b01d1d668e18116c7f8b1fc56f197620a10c91f
    Partial-Bug: #1828988
    Signed-off-by: Nicolas Bock <email address hidden>

Revision history for this message

James Page (james-page) wrote on 2020-07-20:

#16

Other than the action already committed is there any further action required on this bug report?

Changed in charm-rabbitmq-server:
importance:	Undecided → High
status:	In Progress → Incomplete

Revision history for this message

James Page (james-page) wrote on 2020-07-20:

#17

The charm deployment guide has a specific section on rabbitmq restarts - maybe we need to detail something about the action there:

https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-managing-power-events.html#rabbitmq-server

Changed in charm-rabbitmq-server:
milestone:	none → 20.08

James Page (james-page) on 2020-08-03

Changed in charm-rabbitmq-server:
milestone:	20.08 → none

OpenStack Infra (hudson-openstack) on 2020-11-17

Changed in charm-rabbitmq-server:
status:	Incomplete → In Progress

Revision history for this message

Nicolas Bock (nicolasbock) wrote on 2021-04-09:

#18

I have added an update to the deployment guide [1]. I believe that this is sufficient to close the bug.

[1] https://review.opendev.org/c/openstack/charm-deployment-guide/+/785727

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-19:

#19

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/716776
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/8015d9a365410efa5ef6d952ed33364b1679c0ef
Submitter: "Zuul (22348)"
Branch: master

commit 8015d9a365410efa5ef6d952ed33364b1679c0ef
Author: Nicolas Bock <email address hidden>
Date: Wed Apr 1 14:56:16 2020 -0600

Add config parameters to tune mnesia settings

    When a RabbitMQ cluster is restarted, the mnesia settings determine
    how long and how often each broker will try to connect to the cluster
    before giving up. It might be useful for an operator to be able to
    tune these parameters. This change adds two settings,
    `mnesia-table-loading-retry-timeout` and
    `mnesia-table-loading-retry-limit`, which set these parameters in the
    rabbitmq.config file [1].

[1] https://www.rabbitmq.com/configure.html#config-items

    Change-Id: I96aa8c4061aed47eb2e844d1bec44fafd379ac25
    Partial-Bug: #1828988
    Related-Bug: #1874075
    Co-authored-by: Nicolas Bock <email address hidden>
    Co-authored-by: Aurelien Lourot <email address hidden>

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-08-26:

#20

It seems like all the mentioned reviews together may be enough to solve this issue. Closing for now. Feel free to re-open.

Changed in charm-rabbitmq-server:
status:	In Progress → Fix Committed
milestone:	none → 21.10

Alex Kavanagh (ajkavanagh) on 2021-10-22

Changed in charm-rabbitmq-server:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-09-12: Change abandoned on charm-rabbitmq-server (master)

#21

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/732452
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.