Juju Charms Collection
rabbitmq-server package

service restart error not caught, waits forever

Bug #1575349 reported by Andreas Hasenack on 2016-04-26

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	rabbitmq-server (Juju Charms Collection)	Fix Released	Medium	Unassigned	Juju Charms Collection 16.10

Bug Description

I have deployed cs:trusty/rabbitmq-server-43 with juju 2.0beta6 and the service is stuck:

rabbitmq-server/0 maintenance executing 2.0-beta6 0/lxc/3 10.96.12.161 (config-changed) Waiting for rabbitmq app to start: /<email address hidden>

It's indeed waiting for a pid file to appear:
965 ? Ssl 0:00 /var/lib/juju/tools/unit-rabbitmq-server-0/jujud unit --data-dir /var/lib/juju --unit-name rabbitmq-server/0 --debug
6634 ? S 0:00 \_ /usr/bin/python /var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/config-changed
8182 ? S 0:00 \_ /bin/sh /usr/sbin/rabbitmqctl wait /<email address hidden>
8191 ? S 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/<email address hidden>"
8192 ? Ss 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/<email address hidden>"
8193 ? Sl 0:02 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl8193 -boot start_clean -s rabbit_control_main -nodename rabbit@euphoric-hook -extra wait /<email address hidden>

That directory has no such file.

Turns out rabbit failed to start:

# cat /var/log/rabbitmq/startup_log
ERROR: epmd error for host "euphoric-hook": nxdomain (non-existing domain)

Somehow that failure was not caught, and now the charm is telling rabbit to wait forever for a pid file that will never show up.

Looking at the charm code, wait_app() is used after a restart in all cases. Something like this:

service_restart('rabbitmq-server')
rabbit.wait_app()

So bug number one: the service_restart() failure was not caught.

Bug number two: wait_app() should not wait forever. It should implement a reasonable timeout.

See original description

Tags:

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-04-26:

unit-rabbitmq-server-0.log Edit (240.7 KiB, text/plain)

I induced a failure to get the charm moving on, then tried a restart via the shell:

root@juju-machine-0-lxc-3:/var/log/rabbitmq# service rabbitmq-server restart
* Restarting message broker rabbitmq-server [fail]
root@juju-machine-0-lxc-3:/var/log/rabbitmq# echo $?
1

It failed as it should, and for the same reason:
# cat /var/log/rabbitmq/startup_log
ERROR: epmd error for host "euphoric-hook": nxdomain (non-existing domain)

So why didn't the charm catch it? I don't know, the code looks fine at first glance.

def service(action, service_name):
    """Control a system service"""
    if init_is_systemd():
        cmd = ['systemctl', action, service_name]
    else:
        cmd = ['service', service_name, action]
    return subprocess.call(cmd) == 0

I'm attaching the rabbit unit logs.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-04-26:

It really didn't notice the failure:
2016-04-26 19:44:02 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["Starting rabbitmq-server."]
2016-04-26 19:44:02 INFO juju-log Starting rabbitmq-server.
2016-04-26 19:44:02 INFO config-changed * Restarting message broker rabbitmq-server
2016-04-26 19:44:04 INFO config-changed ...fail!
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["-l" "INFO" "getting local nodename for ip address: 10.96.12.161"]
2016-04-26 19:44:04 INFO juju-log getting local nodename for ip address: 10.96.12.161
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["-l" "INFO" "local nodename: euphoric-hook"]
2016-04-26 19:44:04 INFO juju-log local nodename: euphoric-hook
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "status-set" ["maintenance" "Waiting for rabbitmq app to start: /<email address hidden>"]
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["-l" "DEBUG" "Running ['/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']"]
2016-04-26 19:44:04 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2016-04-26 19:44:04 INFO config-changed Waiting for 'rabbit@euphoric-hook' ...

Andreas Hasenack (ahasenack) on 2016-04-26

description:

updated

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-04-27:

When rabbit is deployed with juju1, the following happens differently:
a) /etc/resolv.conf gets a "search <domain>" entry, that an lxc under MAAS deployed with juju2 does not

b) the lxc IP is just a normal IP obtained via DHCP, instead of a device in MAAS:
Apr 27 20:00:31 atlas dhcpd: DHCPREQUEST for 10.96.9.136 (10.96.0.10) from 00:16:3e:55:8e:a4 (juju-machine-0-lxc-7) via br0

When deployed with juju2, the container gets an IP from a "maas device":
Apr 26 19:39:28 atlas maas.node_query: [INFO] shawmut: Power is on.
Apr 26 19:41:08 atlas maas.api: [INFO] euphoric-hook: Added new device (parent: shawmut)
Apr 26 19:41:09 atlas maas.dns: [INFO] Generating new DNS zone file for scapestack

This is what triggered this rabbit bug: no "search <domain>" entry in /etc/resolv.conf. One can argue that it's not a bug in this charm then, but the charm should have failed right away when the restart action failed, instead of getting stuck forever. So there is still a charm bug.

🤖 Landscape Builder (landscape-builder) on 2016-05-10

tags:

removed: kanban-cross-team

Revision history for this message

James Page (james-page) wrote on 2016-12-13:

rabbitmq/erlang is very sensitive to broken/poorly configured forward/reverse DNS.

We can't/should not try to fix that specific in the charm (that's more of a general charm environment problem for Juju IMHO), but the change made in commit 38912a540c8d259cf25ef29adc55c3c5ff55157b does introduce a 180 second timeout on all wait operations, which should mean that the charm enters an error start if RMQ fails to startup due to this type of issue.

Revision history for this message

James Page (james-page) wrote on 2016-12-13:

Failure in service_restart is async to the subprocess call that makes it (so can't fix that one).

Timeout of 180 on wait should fix bug 2.

Marking Fix Released (this was included in the 16.10 charm release).