We had several autopilot runs fail with a juju run of /bin/true timeout (1h is our timeout). We do this run right after we related the hacluster services, and before we relate all the principals.
Debugging shows that in one of the 3 rabbit units "rabbitmq_ctl stop" right after the /etc/hosts file was changed didn't really work, and the subsequent start then just doesn't work.
This is the story we were able to reconstruct from the logs (all attached):
rabbitmq-server/1 stuck:
rabbitmq-server/0: agent-state: started agent-status: {current: idle, since: '12 Jun 2016 12:27:06Z', version: 1.25.5} agent-version: 1.25.5
machine: 5/lxc/2
open-ports: [5672/tcp] public-address: 10.96.4.112 workload-status: {current: active, message: Unit is ready and clustered, since: '12
Jun 2016 12:27:04Z'}
rabbitmq-server/1: agent-state: started agent-status: {current: executing, message: running cluster-relation-changed
hook, since: '12 Jun 2016 11:07:29Z', version: 1.25.5} agent-version: 1.25.5
machine: 2/lxc/0
open-ports: [5672/tcp] public-address: 10.96.8.38 workload-status: {current: maintenance, message: Starting rabbitmq applilcation,
since: '12 Jun 2016 11:09:23Z'}
rabbitmq-server/2: agent-state: started agent-status: {current: idle, since: '12 Jun 2016 12:25:06Z', version: 1.25.5} agent-version: 1.25.5
machine: 1/lxc/0
open-ports: [5672/tcp] public-address: 10.96.3.15 workload-status: {current: active, message: Unit is ready and clustered, since: '12
Jun 2016 12:25:04Z'}
rabbitmq on that unit confirmed started at 17:07:36:
2016-06-12 11:07:29 INFO cluster-relation-changed * Stopping message broker rabbitmq-server
2016-06-12 11:07:32 INFO cluster-relation-changed ...done.
2016-06-12 11:07:33 INFO cluster-relation-changed * Restarting message broker rabbitmq-server
2016-06-12 11:07:35 INFO cluster-relation-changed ...done.
2016-06-12 11:07:35 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'wait', '/var/run/rabbitmq/pid']
2016-06-12 11:07:35 INFO cluster-relation-changed Waiting for 'rabbit@juju-machine-2-lxc-0' ...
2016-06-12 11:07:35 INFO cluster-relation-changed pid is 11028 ...
2016-06-12 11:07:36 INFO cluster-relation-changed ...done.
hosts file updated at 11:07:37:
2016-06-12 11:07:37 INFO juju-log cluster:4: Updating hosts file with: {u'10.96.4.112': u'juju-machine-5-lxc-2'} (current: ['127.0.0.1 localhost\n', '\n', '# The following lines are desirable for IPv6 capable hosts\n', '::1 ip6-localhost ip6-loopback\n', 'fe00::0 ip6-localnet\n', 'ff00::0 ip6-mcastprefix\n', 'ff02::1 ip6-allnodes\n', 'ff02::2 ip6-allrouters\n', 'ff02::3 ip6-allhosts\n'])
Comparing the timestamps with the actual rabbit logs, we can see that when start was issued at 11:09:23, rabbit was already failing hard, probably as a result of the stop action:
=INFO REPORT==== 12-Jun-2016::11:07:35 ===
Server startup complete; 0 plugins started.
=INFO REPORT==== 12-Jun-2016::11:08:49 ===
stopped TCP Listener on [::]:5672
=WARNING REPORT==== 12-Jun-2016::11:09:11 ===
global: 'rabbit@juju-machine-2-lxc-0' failed to connect to 'rabbit@juju-machine-1-lxc-0'
=INFO REPORT==== 12-Jun-2016::11:09:13 ===
Clustering with ['rabbit@juju-machine-5-lxc-2'] as disc node
=ERROR REPORT==== 12-Jun-2016::11:09:13 ===
Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0']
=ERROR REPORT==== 12-Jun-2016::11:09:13 ===
Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0']
(... repeats about half a million times ...)
$ grep "could not connect to node" <email address hidden> |wc -l
438068
ps output of 2/lxc/0 (rabbitmq-server/1, also attached, don't worry about formatting here in the bug). You can see the start attempt, and the already-running rabbit, as well as the stuck juju run:
cs:trusty/ rabbitmq- server- 47
We had several autopilot runs fail with a juju run of /bin/true timeout (1h is our timeout). We do this run right after we related the hacluster services, and before we relate all the principals.
Debugging shows that in one of the 3 rabbit units "rabbitmq_ctl stop" right after the /etc/hosts file was changed didn't really work, and the subsequent start then just doesn't work.
This is the story we were able to reconstruct from the logs (all attached):
rabbitmq-server/1 stuck:
rabbitmq- server/ 0:
agent- state: started
agent- status: {current: idle, since: '12 Jun 2016 12:27:06Z', version: 1.25.5}
agent- version: 1.25.5
public- address: 10.96.4.112
workload- status: {current: active, message: Unit is ready and clustered, since: '12 server/ 1:
agent- state: started
agent- status: {current: executing, message: running cluster- relation- changed
agent- version: 1.25.5
public- address: 10.96.8.38
workload- status: {current: maintenance, message: Starting rabbitmq applilcation, server/ 2:
agent- state: started
agent- status: {current: idle, since: '12 Jun 2016 12:25:06Z', version: 1.25.5}
agent- version: 1.25.5
public- address: 10.96.3.15
workload- status: {current: active, message: Unit is ready and clustered, since: '12
machine: 5/lxc/2
open-ports: [5672/tcp]
Jun 2016 12:27:04Z'}
rabbitmq-
hook, since: '12 Jun 2016 11:07:29Z', version: 1.25.5}
machine: 2/lxc/0
open-ports: [5672/tcp]
since: '12 Jun 2016 11:09:23Z'}
rabbitmq-
machine: 1/lxc/0
open-ports: [5672/tcp]
Jun 2016 12:25:04Z'}
rabbitmq on that unit confirmed started at 17:07:36: relation- changed * Stopping message broker rabbitmq-server relation- changed ...done. relation- changed * Restarting message broker rabbitmq-server relation- changed ...done. sbin/rabbitmqct l', 'wait', '/var/run/ rabbitmq/ pid'] relation- changed Waiting for 'rabbit@ juju-machine- 2-lxc-0' ... relation- changed pid is 11028 ... relation- changed ...done.
2016-06-12 11:07:29 INFO cluster-
2016-06-12 11:07:32 INFO cluster-
2016-06-12 11:07:33 INFO cluster-
2016-06-12 11:07:35 INFO cluster-
2016-06-12 11:07:35 DEBUG juju-log cluster:4: Running ['/usr/
2016-06-12 11:07:35 INFO cluster-
2016-06-12 11:07:35 INFO cluster-
2016-06-12 11:07:36 INFO cluster-
hosts file updated at 11:07:37: machine- 5-lxc-2' } (current: ['127.0.0.1 localhost\n', '\n', '# The following lines are desirable for IPv6 capable hosts\n', '::1 ip6-localhost ip6-loopback\n', 'fe00::0 ip6-localnet\n', 'ff00::0 ip6-mcastprefix\n', 'ff02::1 ip6-allnodes\n', 'ff02::2 ip6-allrouters\n', 'ff02::3 ip6-allhosts\n'])
2016-06-12 11:07:37 INFO juju-log cluster:4: Updating hosts file with: {u'10.96.4.112': u'juju-
Here it gets interesting.
2016-06-12 11:08:48 DEBUG juju-log cluster:4: Running ['/usr/ sbin/rabbitmqct l', 'stop_app'] relation- changed Stopping node 'rabbit@ juju-machine- 2-lxc-0' ... relation- changed ...done. sbin/rabbitmqct l', 'start_app'] relation- changed Starting node 'rabbit@ juju-machine- 2-lxc-0' ...
2016-06-12 11:08:49 INFO cluster-
2016-06-12 11:08:57 INFO cluster-
2016-06-12 11:09:23 DEBUG juju-log cluster:4: Running ['/usr/
2016-06-12 11:09:23 INFO cluster-
Comparing the timestamps with the actual rabbit logs, we can see that when start was issued at 11:09:23, rabbit was already failing hard, probably as a result of the stop action: 2016::11: 07:35 ===
=INFO REPORT==== 12-Jun-
Server startup complete; 0 plugins started.
=INFO REPORT==== 12-Jun- 2016::11: 08:49 ===
Stopping RabbitMQ
=INFO REPORT==== 12-Jun- 2016::11: 08:49 ===
stopped TCP Listener on [::]:5672
=WARNING REPORT==== 12-Jun- 2016::11: 09:11 === juju-machine- 2-lxc-0' failed to connect to 'rabbit@ juju-machine- 1-lxc-0'
global: 'rabbit@
=INFO REPORT==== 12-Jun- 2016::11: 09:13 === juju-machine- 5-lxc-2' ] as disc node
Clustering with ['rabbit@
=ERROR REPORT==== 12-Jun- 2016::11: 09:13 === 'rabbit@ juju-machine- 2-lxc-0' ): ** ERROR ** Mnesia on 'rabbit@ juju-machine- 2-lxc-0' could not connect to node(s) ['rabbit@ juju-machine- 1-lxc-0' ]
Mnesia(
=ERROR REPORT==== 12-Jun- 2016::11: 09:13 === 'rabbit@ juju-machine- 2-lxc-0' ): ** ERROR ** Mnesia on 'rabbit@ juju-machine- 2-lxc-0' could not connect to node(s) ['rabbit@ juju-machine- 1-lxc-0' ]
Mnesia(
(... repeats about half a million times ...)
$ grep "could not connect to node" <email address hidden> |wc -l
438068
ps output of 2/lxc/0 (rabbitmq-server/1, also attached, don't worry about formatting here in the bug). You can see the start attempt, and the already-running rabbit, as well as the stuck juju run:
root 1012 0.0 0.2 478564 40384 ? Ssl 11:03 0:00 /var/lib/ juju/tools/ unit-rabbitmq- server- 1/jujud unit --data-dir /var/lib/juju --unit-name rabbitmq-server/1 --debug juju/agents/ unit-rabbitmq- server- 1/charm/ hooks/cluster- relation- changed rabbitmqctl start_app rabbitmq/ bin/rabbitmqctl "start_app" rabbitmq/ bin/rabbitmqctl "start_app" erlang/ erts-5. 10.4/bin/ beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/ rabbitmq/ lib/rabbitmq_ server- 3.2.4/sbin/ ../ebin -noshell -noinput -hidden -sname rabbitmqctl11747 -boot start_clean -s rabbit_control_main -nodename rabbit@ juju-machine- 2-lxc-0 -extra start_app erlang/ erts-5. 10.4/bin/ epmd -daemon rabbitmq- server erlang/ erts-5. 10.4/bin/ beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/ rabbitmq/ lib/rabbitmq_ server- 3.2.4/sbin/ ../ebin -noshell -noinput -s rabbit boot -sname rabbit@ juju-machine- 2-lxc-0 -boot start_sasl -config /etc/rabbitmq/ rabbitmq -kernel inet_default_ connect_ options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_ plugins_ file "/etc/rabbitmq/ enabled_ plugins" -rabbit plugins_dir "/usr/lib/ rabbitmq/ lib/rabbitmq_ server- 3.2.4/sbin/ ../plugins" -rabbit plugins_expand_dir "/var/lib/ rabbitmq/ mnesia/ rabbit@ juju-machine- 2-lxc-0- plugins- expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/ rabbitmq/ mnesia/ rabbit@ juju-machine- 2-lxc-0"
root 11373 0.0 0.1 92432 23328 ? S 11:07 0:01 \_ /usr/bin/python /var/lib/
root 11737 0.0 0.0 4448 1688 ? S 11:09 0:00 \_ /bin/sh /usr/sbin/
root 11745 0.0 0.0 61680 3280 ? S 11:09 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/
rabbitmq 11746 0.0 0.0 4448 684 ? Ss 11:09 0:00 \_ sh -c /usr/lib/
rabbitmq 11747 0.0 0.2 457144 48444 ? Sl 11:09 0:00 \_ /usr/lib/
rabbitmq 11787 0.0 0.0 7464 972 ? Ss 11:09 0:00 \_ inet_gethost 4
rabbitmq 11788 0.0 0.0 13784 1744 ? S 11:09 0:00 \_ inet_gethost 4
rabbitmq 7984 0.0 0.0 7500 1452 ? S 11:06 0:00 /usr/lib/
rabbitmq 11009 0.0 0.0 4448 748 ? S 11:07 0:00 /bin/sh /usr/sbin/
rabbitmq 11028 17.8 0.4 2258272 71584 ? Sl 11:07 12:02 \_ /usr/lib/
rabbitmq 11173 1.4 0.0 7464 880 ? Ss 11:07 0:58 \_ inet_gethost 4
rabbitmq 11174 4.0 0.0 13784 1920 ? R 11:07 2:45 \_ inet_gethost 4
ubuntu 11919 0.0 0.0 11116 2664 ? Ss 11:12 0:00 /bin/bash -s
ubuntu 11920 0.0 0.2 255080 35404 ? Sl 11:12 0:00 \_ juju-run --no-context /bin/true