rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing /etc/hosts entry)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Landscape Server |
New
|
Undecided
|
Unassigned | ||
16.06 |
Fix Released
|
High
|
Andreas Hasenack | ||
rabbitmq-server (Juju Charms Collection) |
Fix Released
|
High
|
James Page |
Bug Description
cs:trusty/
landscape reference: https:/
We had several autopilot runs fail with a juju run of /bin/true timeout (1h is our timeout). We do this run right after we related the hacluster services, and before we relate all the principals.
Debugging shows that in one of the 3 rabbit units "rabbitmq_ctl stop" right after the /etc/hosts file was changed didn't really work, and the subsequent start then just doesn't work.
This is the story we were able to reconstruct from the logs (all attached):
rabbitmq-server/1 stuck:
rabbitmq-
machine: 5/lxc/2
open-ports: [5672/tcp]
Jun 2016 12:27:04Z'}
rabbitmq-
hook, since: '12 Jun 2016 11:07:29Z', version: 1.25.5}
machine: 2/lxc/0
open-ports: [5672/tcp]
since: '12 Jun 2016 11:09:23Z'}
rabbitmq-
machine: 1/lxc/0
open-ports: [5672/tcp]
Jun 2016 12:25:04Z'}
rabbitmq on that unit confirmed started at 17:07:36:
2016-06-12 11:07:29 INFO cluster-
2016-06-12 11:07:32 INFO cluster-
2016-06-12 11:07:33 INFO cluster-
2016-06-12 11:07:35 INFO cluster-
2016-06-12 11:07:35 DEBUG juju-log cluster:4: Running ['/usr/
2016-06-12 11:07:35 INFO cluster-
2016-06-12 11:07:35 INFO cluster-
2016-06-12 11:07:36 INFO cluster-
hosts file updated at 11:07:37:
2016-06-12 11:07:37 INFO juju-log cluster:4: Updating hosts file with: {u'10.96.4.112': u'juju-
Here it gets interesting.
2016-06-12 11:08:48 DEBUG juju-log cluster:4: Running ['/usr/
2016-06-12 11:08:49 INFO cluster-
2016-06-12 11:08:57 INFO cluster-
2016-06-12 11:09:23 DEBUG juju-log cluster:4: Running ['/usr/
2016-06-12 11:09:23 INFO cluster-
Comparing the timestamps with the actual rabbit logs, we can see that when start was issued at 11:09:23, rabbit was already failing hard, probably as a result of the stop action:
=INFO REPORT==== 12-Jun-
Server startup complete; 0 plugins started.
=INFO REPORT==== 12-Jun-
Stopping RabbitMQ
=INFO REPORT==== 12-Jun-
stopped TCP Listener on [::]:5672
=WARNING REPORT==== 12-Jun-
global: 'rabbit@
=INFO REPORT==== 12-Jun-
Clustering with ['rabbit@
=ERROR REPORT==== 12-Jun-
Mnesia(
=ERROR REPORT==== 12-Jun-
Mnesia(
(... repeats about half a million times ...)
$ grep "could not connect to node" <email address hidden> |wc -l
438068
ps output of 2/lxc/0 (rabbitmq-server/1, also attached, don't worry about formatting here in the bug). You can see the start attempt, and the already-running rabbit, as well as the stuck juju run:
root 1012 0.0 0.2 478564 40384 ? Ssl 11:03 0:00 /var/lib/
root 11373 0.0 0.1 92432 23328 ? S 11:07 0:01 \_ /usr/bin/python /var/lib/
root 11737 0.0 0.0 4448 1688 ? S 11:09 0:00 \_ /bin/sh /usr/sbin/
root 11745 0.0 0.0 61680 3280 ? S 11:09 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/
rabbitmq 11746 0.0 0.0 4448 684 ? Ss 11:09 0:00 \_ sh -c /usr/lib/
rabbitmq 11747 0.0 0.2 457144 48444 ? Sl 11:09 0:00 \_ /usr/lib/
rabbitmq 11787 0.0 0.0 7464 972 ? Ss 11:09 0:00 \_ inet_gethost 4
rabbitmq 11788 0.0 0.0 13784 1744 ? S 11:09 0:00 \_ inet_gethost 4
rabbitmq 7984 0.0 0.0 7500 1452 ? S 11:06 0:00 /usr/lib/
rabbitmq 11009 0.0 0.0 4448 748 ? S 11:07 0:00 /bin/sh /usr/sbin/
rabbitmq 11028 17.8 0.4 2258272 71584 ? Sl 11:07 12:02 \_ /usr/lib/
rabbitmq 11173 1.4 0.0 7464 880 ? Ss 11:07 0:58 \_ inet_gethost 4
rabbitmq 11174 4.0 0.0 13784 1920 ? R 11:07 2:45 \_ inet_gethost 4
ubuntu 11919 0.0 0.0 11116 2664 ? Ss 11:12 0:00 /bin/bash -s
ubuntu 11920 0.0 0.2 255080 35404 ? Sl 11:12 0:00 \_ juju-run --no-context /bin/true
tags: | removed: kanban-cross-team |
tags: | added: cdo-qa-blocker |
no longer affects: | landscape/16.05 |
Changed in rabbitmq-server (Juju Charms Collection): | |
assignee: | nobody → James Page (james-page) |
status: | New → In Progress |
milestone: | none → 16.07 |
importance: | Undecided → High |
Changed in rabbitmq-server (Juju Charms Collection): | |
status: | Fix Committed → Fix Released |
Changed in rabbitmq-server (Juju Charms Collection): | |
milestone: | 16.07 → 16.10 |
summary: |
- rabbitmq_ctl stop didn't work and subsequent start gets stuck + rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing + /etc/hosts/ entry) |
summary: |
rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing - /etc/hosts/ entry) + /etc/hosts entry) |
Changed in rabbitmq-server (Juju Charms Collection): | |
status: | Confirmed → Triaged |
Changed in rabbitmq-server (Juju Charms Collection): | |
status: | Fix Committed → Fix Released |
tags: | removed: cdo-qa-blocker |
The process listing of each unit can be found in var/log/ ps-fauxww. txt