cloud install fails due to config-changed hook failure in rabbitmq-server service

Bug #1500204 reported by chigozie
32
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Landscape Server
Fix Released
Undecided
Unassigned
15.07
Won't Fix
Undecided
Unassigned
Cisco-odl
Fix Released
Undecided
Chris Glass
falkor
Fix Released
High
Данило Шеган
rabbitmq-server (Juju Charms Collection)
Fix Released
High
David Ames

Bug Description

Sometimes, clusters form split-brained with the min-cluster-size=3 on juju 1.24.6. The end result looks like this:

dpb@helo:~[0]$ juju run --service rabbitmq-server 'sudo rabbitmqctl cluster_status'
- MachineId: 0/lxc/12
  Stdout: |
    Cluster status of node 'rabbit@10-1-61-43' ...
    [{nodes,[{disc,['rabbit@10-1-61-43','rabbit@10-1-63-87']}]},
     {running_nodes,['rabbit@10-1-63-87','rabbit@10-1-61-43']},
     {partitions,[]}]
    ...done.
  UnitId: rabbitmq-server/0
- MachineId: 0/lxc/13
  Stdout: |
    Cluster status of node 'rabbit@10-1-63-87' ...
    [{nodes,[{disc,['rabbit@10-1-61-43','rabbit@10-1-63-87']}]},
     {running_nodes,['rabbit@10-1-61-43','rabbit@10-1-63-87']},
     {partitions,[]}]
    ...done.
  UnitId: rabbitmq-server/1
- MachineId: 0/lxc/14
  Stdout: |
    Cluster status of node 'rabbit@10-1-63-88' ...
    [{nodes,[{disc,['rabbit@10-1-63-88']}]},
     {running_nodes,['rabbit@10-1-63-88']},
     {partitions,[]}]
    ...done.
  UnitId: rabbitmq-server/2

Lots of logging and debugging attached, I can attach more if necessary.

Related branches

Revision history for this message
Chris Glass (tribaal) wrote :

Thanks for your bug report. Could you please attach /var/log/juju/* (at least all-machines.log) to this bug report?

Also, what does running "/usr/sbin/rabbitmqctl set_permissions -p nagios-rabbitmq-server-2 nagios-rabbitmq-server-2 .* .* .*" on the failed unit yield?

Thanks.

Changed in falkor:
status: New → Incomplete
Revision history for this message
chigozie (ceasiabaka) wrote :

i attached the pod-collect-logs output.

Revision history for this message
chigozie (ceasiabaka) wrote :
Revision history for this message
chigozie (ceasiabaka) wrote :

requested rabbitmqctl command syntax might be wrong.

root@juju-machine-1-lxc-9:/home/ubuntu# /usr/sbin/rabbitmqctl set_permissions -p nagios-rabbitmq-server-2 nagios-rabbitmq-server-2 .* .* .*
Error: invalid command 'set_permissions nagios-rabbitmq-server-2 . .. .bash_history .bash_logout .bashrc .cache .erlang.cookie .juju-proxy .profile .ssh . .. .bash_history .bash_logout .bashrc .cache .erlang.cookie .juju-proxy .profile .ssh . .. .bash_history .bash_logout .bashrc .cache .erlang.cookie .juju-proxy .profile .ssh'

David Britton (dpb)
Changed in falkor:
status: Incomplete → New
Revision history for this message
David Britton (dpb) wrote :

Note -- all-machines.log and cloud-logs.txz are from two separate deployments (just pointing out for future readers)

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

From the txz. Units 0 and 1 finish installing and step back because the cluster is not yet complete. Last messages from both:

unit-rabbitmq-server-1[1068]: 2015-09-26 19:37:23 INFO unit.rabbitmq-server/1.juju-log server.go:254 cluster:1: cluster_joined: cookie not yet set.
unit-rabbitmq-server-0[969]: 2015-09-26 19:37:25 INFO unit.rabbitmq-server/0.juju-log server.go:254 cluster:1: cluster_joined: cookie not yet set.

Then unit 2 comes along, installs the packages and creates the same basic vhosts and users as the previous two units, but fails for no apparent reason:
(...)
unit-rabbitmq-server-2[1056]: 2015-09-26 19:39:02 INFO unit.rabbitmq-server/2.config-changed logger.go:40 Creating vhost "nagios-rabbitmq-server-2" ...
unit-rabbitmq-server-2[1056]: 2015-09-26 19:39:02 INFO unit.rabbitmq-server/2.config-changed logger.go:40 ...done.
unit-rabbitmq-server-2[1056]: 2015-09-26 19:39:02 INFO unit.rabbitmq-server/2.juju-log server.go:254 Created new vhost (nagios-rabbitmq-server-2).
unit-rabbitmq-server-2[1056]: 2015-09-26 19:39:10 INFO unit.rabbitmq-server/2.config-changed logger.go:40 Error: unable to connect to node 'rabbit@10-20-0-99': nodedown
...
unit-rabbitmq-server-2[1056]: 2015-09-26 19:39:10 INFO unit.rabbitmq-server/2.config-changed logger.go:40 subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'list_users']' returned non-zero exit status 2

It's as if the rabbit app was shutdown mid-deploy, or the client somehow failed to reach the server even when it was up.

The rabbit logs around the same time, on rabbitmq-server/2, show a startup at 19:38:56, and confirm the successful creation of the nagios-rabbitmq-server-2 vhost:
=INFO REPORT==== 26-Sep-2015::19:38:56 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 26-Sep-2015::19:38:56 ===
Server startup complete; 0 plugins started.

=INFO REPORT==== 26-Sep-2015::19:39:02 ===
Adding vhost 'nagios-rabbitmq-server-2'

That's where the rabbit log from unit 2 ends. Nothing indicates that the rabbit service died.

Revision history for this message
chigozie (ceasiabaka) wrote :
Download full text (4.8 KiB)

nit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 DIAGNOSTICS
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 ===========
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 nodes in question: ['rabbit@10-20-0-111']
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 hosts, their running nodes and ports:
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 - 10-20-0-111: [{rabbit,49650},{rabbitmqctl12390,35802}]
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 current node details:
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 - node name: 'rabbitmqctl12390@juju-machine-2-lxc-4'
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 - home dir: /var/lib/rabbitmq
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 - cookie hash: Lfvj/SZFPWjD4a7UDb43GQ==
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 Traceback (most recent call last):
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/amqp-relation-changed", line 811, in <module>
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 hooks.execute(sys.argv)
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/core/hookenv.py", line 672, in execute
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 self._hooks[hook_name]()
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/amqp-relation-changed", line 208, in amqp_changed
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 admin=settings.get('admin', False))
unit-rabbitmq-server-0[999]: 2015-09-29 15:57:16 INFO unit.rabbitmq-server/0.amqp-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/amq...

Read more...

Revision history for this message
chigozie (ceasiabaka) wrote :
Revision history for this message
chigozie (ceasiabaka) wrote :
Revision history for this message
chigozie (ceasiabaka) wrote :
Revision history for this message
chigozie (ceasiabaka) wrote :
David Britton (dpb)
Changed in falkor:
importance: Undecided → High
status: New → Triaged
information type: Proprietary → Private
information type: Private → Public
David Britton (dpb)
description: updated
Changed in rabbitmq-server (Juju Charms Collection):
importance: Undecided → High
David Ames (thedac)
Changed in rabbitmq-server (Juju Charms Collection):
status: New → Confirmed
assignee: nobody → David Ames (thedac)
David Britton (dpb)
Changed in rabbitmq-server (Juju Charms Collection):
status: Confirmed → In Progress
tags: added: kanban-cross-team
David Britton (dpb)
Changed in falkor:
status: Triaged → In Progress
Changed in rabbitmq-server (Juju Charms Collection):
status: In Progress → Fix Committed
milestone: none → 15.10
Changed in falkor:
milestone: none → 0.12
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Won't fix for landscape release-29.

James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in falkor:
milestone: 0.12 → 0.13
David Britton (dpb)
Changed in landscape:
status: New → Fix Committed
Changed in falkor:
assignee: nobody → Chris Glass (tribaal)
David Britton (dpb)
Changed in landscape:
milestone: none → 15.11
David Britton (dpb)
Changed in falkor:
milestone: 0.13 → 0.14
Revision history for this message
Данило Шеган (danilo) wrote :

For reference:
 - Fixed in rabbitmq-server charm version 39 according to https://jujucharms.com/rabbitmq-server/trusty/39
 - lp:landscape/cisco-odl includes it since r9157 from 2015-10-29

Changed in falkor:
assignee: Chris Glass (tribaal) → nobody
assignee: nobody → Данило Шеган (danilo)
Changed in landscape:
status: Fix Committed → Fix Released
Changed in falkor:
milestone: 0.14 → 0.15
Changed in falkor:
milestone: 0.15 → none
milestone: none → 0.16
Changed in falkor:
status: In Progress → Fix Released
milestone: 0.16 → 0.15
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.