service restart error not caught, waits forever

Bug #1575349 reported by Andreas Hasenack
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
rabbitmq-server (Juju Charms Collection)
Fix Released
Medium
Unassigned

Bug Description

I have deployed cs:trusty/rabbitmq-server-43 with juju 2.0beta6 and the service is stuck:

rabbitmq-server/0 maintenance executing 2.0-beta6 0/lxc/3 10.96.12.161 (config-changed) Waiting for rabbitmq app to start: /<email address hidden>

It's indeed waiting for a pid file to appear:
  965 ? Ssl 0:00 /var/lib/juju/tools/unit-rabbitmq-server-0/jujud unit --data-dir /var/lib/juju --unit-name rabbitmq-server/0 --debug
 6634 ? S 0:00 \_ /usr/bin/python /var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/config-changed
 8182 ? S 0:00 \_ /bin/sh /usr/sbin/rabbitmqctl wait /<email address hidden>
 8191 ? S 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/<email address hidden>"
 8192 ? Ss 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "wait" "/<email address hidden>"
 8193 ? Sl 0:02 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl8193 -boot start_clean -s rabbit_control_main -nodename rabbit@euphoric-hook -extra wait /<email address hidden>

That directory has no such file.

Turns out rabbit failed to start:

# cat /var/log/rabbitmq/startup_log
ERROR: epmd error for host "euphoric-hook": nxdomain (non-existing domain)

Somehow that failure was not caught, and now the charm is telling rabbit to wait forever for a pid file that will never show up.

Looking at the charm code, wait_app() is used after a restart in all cases. Something like this:

    service_restart('rabbitmq-server')
    rabbit.wait_app()

So bug number one: the service_restart() failure was not caught.

Bug number two: wait_app() should not wait forever. It should implement a reasonable timeout.

Tags: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I induced a failure to get the charm moving on, then tried a restart via the shell:

root@juju-machine-0-lxc-3:/var/log/rabbitmq# service rabbitmq-server restart
 * Restarting message broker rabbitmq-server [fail]
root@juju-machine-0-lxc-3:/var/log/rabbitmq# echo $?
1

It failed as it should, and for the same reason:
# cat /var/log/rabbitmq/startup_log
ERROR: epmd error for host "euphoric-hook": nxdomain (non-existing domain)

So why didn't the charm catch it? I don't know, the code looks fine at first glance.

def service(action, service_name):
    """Control a system service"""
    if init_is_systemd():
        cmd = ['systemctl', action, service_name]
    else:
        cmd = ['service', service_name, action]
    return subprocess.call(cmd) == 0

I'm attaching the rabbit unit logs.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

It really didn't notice the failure:
2016-04-26 19:44:02 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["Starting rabbitmq-server."]
2016-04-26 19:44:02 INFO juju-log Starting rabbitmq-server.
2016-04-26 19:44:02 INFO config-changed * Restarting message broker rabbitmq-server
2016-04-26 19:44:04 INFO config-changed ...fail!
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["-l" "INFO" "getting local nodename for ip address: 10.96.12.161"]
2016-04-26 19:44:04 INFO juju-log getting local nodename for ip address: 10.96.12.161
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["-l" "INFO" "local nodename: euphoric-hook"]
2016-04-26 19:44:04 INFO juju-log local nodename: euphoric-hook
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "status-set" ["maintenance" "Waiting for rabbitmq app to start: /<email address hidden>"]
2016-04-26 19:44:04 INFO worker.uniter.jujuc server.go:173 running hook tool "juju-log" ["-l" "DEBUG" "Running ['/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']"]
2016-04-26 19:44:04 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2016-04-26 19:44:04 INFO config-changed Waiting for 'rabbit@euphoric-hook' ...

description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

When rabbit is deployed with juju1, the following happens differently:
a) /etc/resolv.conf gets a "search <domain>" entry, that an lxc under MAAS deployed with juju2 does not

b) the lxc IP is just a normal IP obtained via DHCP, instead of a device in MAAS:
Apr 27 20:00:31 atlas dhcpd: DHCPREQUEST for 10.96.9.136 (10.96.0.10) from 00:16:3e:55:8e:a4 (juju-machine-0-lxc-7) via br0

When deployed with juju2, the container gets an IP from a "maas device":
Apr 26 19:39:28 atlas maas.node_query: [INFO] shawmut: Power is on.
Apr 26 19:41:08 atlas maas.api: [INFO] euphoric-hook: Added new device (parent: shawmut)
Apr 26 19:41:09 atlas maas.dns: [INFO] Generating new DNS zone file for scapestack

This is what triggered this rabbit bug: no "search <domain>" entry in /etc/resolv.conf. One can argue that it's not a bug in this charm then, but the charm should have failed right away when the restart action failed, instead of getting stuck forever. So there is still a charm bug.

tags: removed: kanban-cross-team
Revision history for this message
James Page (james-page) wrote :

rabbitmq/erlang is very sensitive to broken/poorly configured forward/reverse DNS.

We can't/should not try to fix that specific in the charm (that's more of a general charm environment problem for Juju IMHO), but the change made in commit 38912a540c8d259cf25ef29adc55c3c5ff55157b does introduce a 180 second timeout on all wait operations, which should mean that the charm enters an error start if RMQ fails to startup due to this type of issue.

Revision history for this message
James Page (james-page) wrote :

Failure in service_restart is async to the subprocess call that makes it (so can't fix that one).

Timeout of 180 on wait should fix bug 2.

Marking Fix Released (this was included in the 16.10 charm release).

Changed in rabbitmq-server (Juju Charms Collection):
status: New → Fix Released
importance: Undecided → Medium
milestone: none → 16.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.