service restart error not caught, waits forever
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
rabbitmq-server (Juju Charms Collection) |
Fix Released
|
Medium
|
Unassigned |
Bug Description
I have deployed cs:trusty/
rabbitmq-server/0 maintenance executing 2.0-beta6 0/lxc/3 10.96.12.161 (config-changed) Waiting for rabbitmq app to start: /<email address hidden>
It's indeed waiting for a pid file to appear:
965 ? Ssl 0:00 /var/lib/
6634 ? S 0:00 \_ /usr/bin/python /var/lib/
8182 ? S 0:00 \_ /bin/sh /usr/sbin/
8191 ? S 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/
8192 ? Ss 0:00 \_ sh -c /usr/lib/
8193 ? Sl 0:02 \_ /usr/lib/
That directory has no such file.
Turns out rabbit failed to start:
# cat /var/log/
ERROR: epmd error for host "euphoric-hook": nxdomain (non-existing domain)
Somehow that failure was not caught, and now the charm is telling rabbit to wait forever for a pid file that will never show up.
Looking at the charm code, wait_app() is used after a restart in all cases. Something like this:
service_
rabbit.
So bug number one: the service_restart() failure was not caught.
Bug number two: wait_app() should not wait forever. It should implement a reasonable timeout.
I induced a failure to get the charm moving on, then tried a restart via the shell:
root@juju- machine- 0-lxc-3: /var/log/ rabbitmq# service rabbitmq-server restart machine- 0-lxc-3: /var/log/ rabbitmq# echo $?
* Restarting message broker rabbitmq-server [fail]
root@juju-
1
It failed as it should, and for the same reason: rabbitmq/ startup_ log
# cat /var/log/
ERROR: epmd error for host "euphoric-hook": nxdomain (non-existing domain)
So why didn't the charm catch it? I don't know, the code looks fine at first glance.
def service(action, service_name): call(cmd) == 0
"""Control a system service"""
if init_is_systemd():
cmd = ['systemctl', action, service_name]
else:
cmd = ['service', service_name, action]
return subprocess.
I'm attaching the rabbit unit logs.