Activity log for bug #1592200

Date Who What changed Old value New value Message
2016-06-14 00:10:04 Andreas Hasenack bug added bug
2016-06-14 00:10:04 Andreas Hasenack attachment added rabbit-units-logs.tar.bz2 https://bugs.launchpad.net/bugs/1592200/+attachment/4683261/+files/rabbit-units-logs.tar.bz2
2016-06-14 00:10:43 Andreas Hasenack attachment added juju-status.txt https://bugs.launchpad.net/charms/+source/rabbitmq-server/+bug/1592200/+attachment/4683262/+files/juju-status.txt
2016-06-14 00:11:03 Andreas Hasenack description cs:trusty/rabbitmq-server-47 We had several autopilot runs fail with a juju run of /bin/true timeout (1h is our timeout). We do this run right after we related the hacluster services, and before we relate all the principals. Debugging shows that in one of the 3 rabbit units "rabbitmq_ctl stop" right after the /etc/hosts file was changed didn't really work, and the subsequent start then just doesn't work. This is the story we were able to reconstruct from the logs (all attached): rabbitmq-server/1 stuck: rabbitmq-server/0: agent-state: started agent-status: {current: idle, since: '12 Jun 2016 12:27:06Z', version: 1.25.5} agent-version: 1.25.5 machine: 5/lxc/2 open-ports: [5672/tcp] public-address: 10.96.4.112 workload-status: {current: active, message: Unit is ready and clustered, since: '12 Jun 2016 12:27:04Z'} rabbitmq-server/1: agent-state: started agent-status: {current: executing, message: running cluster-relation-changed hook, since: '12 Jun 2016 11:07:29Z', version: 1.25.5} agent-version: 1.25.5 machine: 2/lxc/0 open-ports: [5672/tcp] public-address: 10.96.8.38 workload-status: {current: maintenance, message: Starting rabbitmq applilcation, since: '12 Jun 2016 11:09:23Z'} rabbitmq-server/2: agent-state: started agent-status: {current: idle, since: '12 Jun 2016 12:25:06Z', version: 1.25.5} agent-version: 1.25.5 machine: 1/lxc/0 open-ports: [5672/tcp] public-address: 10.96.3.15 workload-status: {current: active, message: Unit is ready and clustered, since: '12 Jun 2016 12:25:04Z'} rabbitmq on that unit confirmed started at 17:07:36: 2016-06-12 11:07:29 INFO cluster-relation-changed * Stopping message broker rabbitmq-server 2016-06-12 11:07:32 INFO cluster-relation-changed ...done. 2016-06-12 11:07:33 INFO cluster-relation-changed * Restarting message broker rabbitmq-server 2016-06-12 11:07:35 INFO cluster-relation-changed ...done. 2016-06-12 11:07:35 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'wait', '/var/run/rabbitmq/pid'] 2016-06-12 11:07:35 INFO cluster-relation-changed Waiting for 'rabbit@juju-machine-2-lxc-0' ... 2016-06-12 11:07:35 INFO cluster-relation-changed pid is 11028 ... 2016-06-12 11:07:36 INFO cluster-relation-changed ...done. hosts file updated at 11:07:37: 2016-06-12 11:07:37 INFO juju-log cluster:4: Updating hosts file with: {u'10.96.4.112': u'juju-machine-5-lxc-2'} (current: ['127.0.0.1 localhost\n', '\n', '# The following lines are desirable for IPv6 capable hosts\n', '::1 ip6-localhost ip6-loopback\n', 'fe00::0 ip6-localnet\n', 'ff00::0 ip6-mcastprefix\n', 'ff02::1 ip6-allnodes\n', 'ff02::2 ip6-allrouters\n', 'ff02::3 ip6-allhosts\n']) Here it gets interesting. 2016-06-12 11:08:48 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'stop_app'] 2016-06-12 11:08:49 INFO cluster-relation-changed Stopping node 'rabbit@juju-machine-2-lxc-0' ... 2016-06-12 11:08:57 INFO cluster-relation-changed ...done. 2016-06-12 11:09:23 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'start_app'] 2016-06-12 11:09:23 INFO cluster-relation-changed Starting node 'rabbit@juju-machine-2-lxc-0' ... Comparing the timestamps with the actual rabbit logs, we can see that when start was issued at 11:09:23, rabbit was already failing hard, probably as a result of the stop action: =INFO REPORT==== 12-Jun-2016::11:07:35 === Server startup complete; 0 plugins started. =INFO REPORT==== 12-Jun-2016::11:08:49 === Stopping RabbitMQ =INFO REPORT==== 12-Jun-2016::11:08:49 === stopped TCP Listener on [::]:5672 =WARNING REPORT==== 12-Jun-2016::11:09:11 === global: 'rabbit@juju-machine-2-lxc-0' failed to connect to 'rabbit@juju-machine-1-lxc-0' =INFO REPORT==== 12-Jun-2016::11:09:13 === Clustering with ['rabbit@juju-machine-5-lxc-2'] as disc node =ERROR REPORT==== 12-Jun-2016::11:09:13 === Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0'] =ERROR REPORT==== 12-Jun-2016::11:09:13 === Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0'] (... repeats about half a million times ...) $ grep "could not connect to node" rabbit@juju-machine-2-lxc-0.log |wc -l 438068 ps output of 2/lxc/0 (rabbitmq-server/1, also attached, don't worry about formatting here in the bug). You can see the start attempt, and the already-running rabbit, as well as the stuck juju run: root 1012 0.0 0.2 478564 40384 ? Ssl 11:03 0:00 /var/lib/juju/tools/unit-rabbitmq-server-1/jujud unit --data-dir /var/lib/juju --unit-name rabbitmq-server/1 --debug root 11373 0.0 0.1 92432 23328 ? S 11:07 0:01 \_ /usr/bin/python /var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/cluster-relation-changed root 11737 0.0 0.0 4448 1688 ? S 11:09 0:00 \_ /bin/sh /usr/sbin/rabbitmqctl start_app root 11745 0.0 0.0 61680 3280 ? S 11:09 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app" rabbitmq 11746 0.0 0.0 4448 684 ? Ss 11:09 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app" rabbitmq 11747 0.0 0.2 457144 48444 ? Sl 11:09 0:00 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl11747 -boot start_clean -s rabbit_control_main -nodename rabbit@juju-machine-2-lxc-0 -extra start_app rabbitmq 11787 0.0 0.0 7464 972 ? Ss 11:09 0:00 \_ inet_gethost 4 rabbitmq 11788 0.0 0.0 13784 1744 ? S 11:09 0:00 \_ inet_gethost 4 rabbitmq 7984 0.0 0.0 7500 1452 ? S 11:06 0:00 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon rabbitmq 11009 0.0 0.0 4448 748 ? S 11:07 0:00 /bin/sh /usr/sbin/rabbitmq-server rabbitmq 11028 17.8 0.4 2258272 71584 ? Sl 11:07 12:02 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@juju-machine-2-lxc-0 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/rabbit@juju-machine-2-lxc-0.log"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/rabbit@juju-machine-2-lxc-0-sasl.log"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@juju-machine-2-lxc-0-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@juju-machine-2-lxc-0" rabbitmq 11173 1.4 0.0 7464 880 ? Ss 11:07 0:58 \_ inet_gethost 4 rabbitmq 11174 4.0 0.0 13784 1920 ? R 11:07 2:45 \_ inet_gethost 4 ubuntu 11919 0.0 0.0 11116 2664 ? Ss 11:12 0:00 /bin/bash -s ubuntu 11920 0.0 0.2 255080 35404 ? Sl 11:12 0:00 \_ juju-run --no-context /bin/true cs:trusty/rabbitmq-server-47 landscape reference: https://ci.lscape.net/job/landscape-system-tests/2762/ We had several autopilot runs fail with a juju run of /bin/true timeout (1h is our timeout). We do this run right after we related the hacluster services, and before we relate all the principals. Debugging shows that in one of the 3 rabbit units "rabbitmq_ctl stop" right after the /etc/hosts file was changed didn't really work, and the subsequent start then just doesn't work. This is the story we were able to reconstruct from the logs (all attached): rabbitmq-server/1 stuck:       rabbitmq-server/0:         agent-state: started         agent-status: {current: idle, since: '12 Jun 2016 12:27:06Z', version: 1.25.5}         agent-version: 1.25.5         machine: 5/lxc/2         open-ports: [5672/tcp]         public-address: 10.96.4.112         workload-status: {current: active, message: Unit is ready and clustered, since: '12             Jun 2016 12:27:04Z'}       rabbitmq-server/1:         agent-state: started         agent-status: {current: executing, message: running cluster-relation-changed             hook, since: '12 Jun 2016 11:07:29Z', version: 1.25.5}         agent-version: 1.25.5         machine: 2/lxc/0         open-ports: [5672/tcp]         public-address: 10.96.8.38         workload-status: {current: maintenance, message: Starting rabbitmq applilcation,           since: '12 Jun 2016 11:09:23Z'}       rabbitmq-server/2:         agent-state: started         agent-status: {current: idle, since: '12 Jun 2016 12:25:06Z', version: 1.25.5}         agent-version: 1.25.5         machine: 1/lxc/0         open-ports: [5672/tcp]         public-address: 10.96.3.15         workload-status: {current: active, message: Unit is ready and clustered, since: '12             Jun 2016 12:25:04Z'} rabbitmq on that unit confirmed started at 17:07:36: 2016-06-12 11:07:29 INFO cluster-relation-changed * Stopping message broker rabbitmq-server 2016-06-12 11:07:32 INFO cluster-relation-changed ...done. 2016-06-12 11:07:33 INFO cluster-relation-changed * Restarting message broker rabbitmq-server 2016-06-12 11:07:35 INFO cluster-relation-changed ...done. 2016-06-12 11:07:35 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'wait', '/var/run/rabbitmq/pid'] 2016-06-12 11:07:35 INFO cluster-relation-changed Waiting for 'rabbit@juju-machine-2-lxc-0' ... 2016-06-12 11:07:35 INFO cluster-relation-changed pid is 11028 ... 2016-06-12 11:07:36 INFO cluster-relation-changed ...done. hosts file updated at 11:07:37: 2016-06-12 11:07:37 INFO juju-log cluster:4: Updating hosts file with: {u'10.96.4.112': u'juju-machine-5-lxc-2'} (current: ['127.0.0.1 localhost\n', '\n', '# The following lines are desirable for IPv6 capable hosts\n', '::1 ip6-localhost ip6-loopback\n', 'fe00::0 ip6-localnet\n', 'ff00::0 ip6-mcastprefix\n', 'ff02::1 ip6-allnodes\n', 'ff02::2 ip6-allrouters\n', 'ff02::3 ip6-allhosts\n']) Here it gets interesting. 2016-06-12 11:08:48 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'stop_app'] 2016-06-12 11:08:49 INFO cluster-relation-changed Stopping node 'rabbit@juju-machine-2-lxc-0' ... 2016-06-12 11:08:57 INFO cluster-relation-changed ...done. 2016-06-12 11:09:23 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'start_app'] 2016-06-12 11:09:23 INFO cluster-relation-changed Starting node 'rabbit@juju-machine-2-lxc-0' ... Comparing the timestamps with the actual rabbit logs, we can see that when start was issued at 11:09:23, rabbit was already failing hard, probably as a result of the stop action: =INFO REPORT==== 12-Jun-2016::11:07:35 === Server startup complete; 0 plugins started. =INFO REPORT==== 12-Jun-2016::11:08:49 === Stopping RabbitMQ =INFO REPORT==== 12-Jun-2016::11:08:49 === stopped TCP Listener on [::]:5672 =WARNING REPORT==== 12-Jun-2016::11:09:11 === global: 'rabbit@juju-machine-2-lxc-0' failed to connect to 'rabbit@juju-machine-1-lxc-0' =INFO REPORT==== 12-Jun-2016::11:09:13 === Clustering with ['rabbit@juju-machine-5-lxc-2'] as disc node =ERROR REPORT==== 12-Jun-2016::11:09:13 === Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0'] =ERROR REPORT==== 12-Jun-2016::11:09:13 === Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0'] (... repeats about half a million times ...) $ grep "could not connect to node" rabbit@juju-machine-2-lxc-0.log |wc -l 438068 ps output of 2/lxc/0 (rabbitmq-server/1, also attached, don't worry about formatting here in the bug). You can see the start attempt, and the already-running rabbit, as well as the stuck juju run: root 1012 0.0 0.2 478564 40384 ? Ssl 11:03 0:00 /var/lib/juju/tools/unit-rabbitmq-server-1/jujud unit --data-dir /var/lib/juju --unit-name rabbitmq-server/1 --debug root 11373 0.0 0.1 92432 23328 ? S 11:07 0:01 \_ /usr/bin/python /var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/cluster-relation-changed root 11737 0.0 0.0 4448 1688 ? S 11:09 0:00 \_ /bin/sh /usr/sbin/rabbitmqctl start_app root 11745 0.0 0.0 61680 3280 ? S 11:09 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app" rabbitmq 11746 0.0 0.0 4448 684 ? Ss 11:09 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app" rabbitmq 11747 0.0 0.2 457144 48444 ? Sl 11:09 0:00 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl11747 -boot start_clean -s rabbit_control_main -nodename rabbit@juju-machine-2-lxc-0 -extra start_app rabbitmq 11787 0.0 0.0 7464 972 ? Ss 11:09 0:00 \_ inet_gethost 4 rabbitmq 11788 0.0 0.0 13784 1744 ? S 11:09 0:00 \_ inet_gethost 4 rabbitmq 7984 0.0 0.0 7500 1452 ? S 11:06 0:00 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon rabbitmq 11009 0.0 0.0 4448 748 ? S 11:07 0:00 /bin/sh /usr/sbin/rabbitmq-server rabbitmq 11028 17.8 0.4 2258272 71584 ? Sl 11:07 12:02 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@juju-machine-2-lxc-0 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/rabbit@juju-machine-2-lxc-0.log"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/rabbit@juju-machine-2-lxc-0-sasl.log"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@juju-machine-2-lxc-0-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@juju-machine-2-lxc-0" rabbitmq 11173 1.4 0.0 7464 880 ? Ss 11:07 0:58 \_ inet_gethost 4 rabbitmq 11174 4.0 0.0 13784 1920 ? R 11:07 2:45 \_ inet_gethost 4 ubuntu 11919 0.0 0.0 11116 2664 ? Ss 11:12 0:00 /bin/bash -s ubuntu 11920 0.0 0.2 255080 35404 ? Sl 11:12 0:00 \_ juju-run --no-context /bin/true
2016-06-14 00:11:21 Andreas Hasenack bug added subscriber Landscape
2016-06-14 00:22:23 Andreas Hasenack bug task added landscape
2016-06-14 00:22:34 Andreas Hasenack nominated for series landscape/16.05
2016-06-14 00:22:34 Andreas Hasenack bug task added landscape/16.05
2016-06-14 00:22:44 Andreas Hasenack landscape/16.05: milestone 16.05
2016-06-14 00:53:52 🤖 Landscape Builder tags kanban-cross-team landscape landscape
2016-06-14 18:01:38 Andreas Hasenack tags landscape cdo-qa-blocker landscape
2016-06-16 17:49:05 Andreas Hasenack landscape/16.05: status New Won't Fix
2016-06-20 20:44:09 David Britton bug task deleted landscape/16.05
2016-07-07 11:07:31 James Page rabbitmq-server (Juju Charms Collection): assignee James Page (james-page)
2016-07-07 11:07:33 James Page rabbitmq-server (Juju Charms Collection): status New In Progress
2016-07-07 11:07:35 James Page rabbitmq-server (Juju Charms Collection): milestone 16.07
2016-07-07 11:07:37 James Page rabbitmq-server (Juju Charms Collection): importance Undecided High
2016-07-08 14:35:42 OpenStack Infra rabbitmq-server (Juju Charms Collection): status In Progress Fix Committed
2016-07-11 09:08:56 James Page rabbitmq-server (Juju Charms Collection): status Fix Committed Fix Released
2016-07-25 17:25:23 Andreas Hasenack rabbitmq-server (Juju Charms Collection): status Fix Released Confirmed
2016-07-29 07:21:20 Liam Young rabbitmq-server (Juju Charms Collection): milestone 16.07 16.10
2016-08-10 13:38:58 Andreas Hasenack nominated for series landscape/16.06
2016-08-10 13:38:58 Andreas Hasenack bug task added landscape/16.06
2016-08-10 13:39:08 Andreas Hasenack landscape/16.06: status New In Progress
2016-08-10 13:39:11 Andreas Hasenack landscape/16.06: importance Undecided High
2016-08-10 13:39:13 Andreas Hasenack landscape/16.06: assignee Andreas Hasenack (ahasenack)
2016-08-10 13:47:26 R bug added subscriber nemesis.spa
2016-08-10 13:49:41 R removed subscriber nemesis.spa
2016-08-10 14:42:01 🤖 Landscape Builder landscape/16.06: status In Progress Fix Committed
2016-08-10 14:42:01 🤖 Landscape Builder landscape/16.06: milestone 16.06.1
2016-08-15 12:22:57 Andreas Hasenack landscape/16.06: status Fix Committed Fix Released
2016-09-14 15:06:51 Ursula Junque summary rabbitmq_ctl stop didn't work and subsequent start gets stuck rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing /etc/hosts/ entry)
2016-09-14 15:06:59 Ursula Junque summary rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing /etc/hosts/ entry) rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing /etc/hosts entry)
2016-09-19 15:48:01 James Page rabbitmq-server (Juju Charms Collection): status Confirmed Triaged
2016-09-20 09:36:13 OpenStack Infra rabbitmq-server (Juju Charms Collection): status Triaged In Progress
2016-09-23 12:53:37 OpenStack Infra rabbitmq-server (Juju Charms Collection): status In Progress Fix Committed
2016-10-14 09:01:20 James Page rabbitmq-server (Juju Charms Collection): status Fix Committed Fix Released
2017-09-06 20:06:55 Chris Gregan tags cdo-qa-blocker landscape landscape