Activity log for bug #1747347

Date Who What changed Old value New value Message
2018-02-05 04:22:34 Tejeev Patel bug added bug
2018-05-30 11:30:38 Alvaro Uria tags canonical-bootstack
2018-05-30 11:30:53 Alvaro Uria bug added subscriber The Canonical Sysadmins
2018-05-30 11:30:59 Alvaro Uria bug added subscriber Canonical IS BootStack
2018-07-18 23:28:09 Haw Loeung bug added subscriber Haw Loeung
2018-08-07 04:44:22 Paul Gear description We saw a rabbimq-server unit which is the leader in a cluster with 2 others die abruptly. Cluster status on other units remained ok, except for inability to communicate with the downed node. We were unable to determine a root cause. We've seen this happen on 3 other occasions and on one we observed that before starting the rabbit service the jujud agent was in error state; afterwards it was in executing state. I do not know if that was the case this time. == LOGS ETC. == Service status after failure: ● rabbitmq-server.service - RabbitMQ Messaging Server Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Mon 2018-02-05 02:05:45 UTC; 17min ago Process: 85156 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=2) Process: 1663808 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SUCCESS) Process: 1663807 ExecStart=/usr/sbin/rabbitmq-server (code=exited, status=140) Main PID: 1663807 (code=exited, status=140) Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * epmd reports: node 'rabbit' not running at all Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: no other nodes on juju-a8345d-21-lxd-0 Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * suggestion: start the node Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: current node details: Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - node name: 'rabbitmq-cli-85163@juju-a8345d-21-lxd-0' Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - home dir: . Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - cookie hash: 5teFoq/K1+djDqpfMuLHCA== Feb 05 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Control process exited, code=exited status=2 Feb 05 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Unit entered failed state. Feb 05 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'. ================================ From juju unit log, I see repetitions similar to the first section (02:05:42-02:05:44) going back for days. At the time of the failure 02:05:45, it abruptly cuts away to reporting inability to connect. This section of enteries (02:05:45-02:10:49) basically repeats until the service is manually restarted around 02:46:18, where it goes back to the WARNING and entries resembling the first section: 2018-02-05 02:05:42 DEBUG juju-log Granting permissions 2018-02-05 02:05:42 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_permissions', '-p', u'openstack', u'neutron', '.*', '.*', '.*'] 2018-02-05 02:05:42 DEBUG update-status Setting permissions for user "neutron" in vhost "openstack" ... 2018-02-05 02:05:42 INFO juju-log Setting HA policy to vhost 'openstack' 2018-02-05 02:05:42 DEBUG juju-log setting policy 2018-02-05 02:05:42 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_policy', '-p', u'openstack', 'HA', '^(?!amq\\.).*', '{"ha-mode": "all", "ha-sync-mode": "automatic"}'] 2018-02-05 02:05:43 DEBUG update-status Setting policy "HA" for pattern "^(?!amq\\.).*" to "{\"ha-mode\": \"all\", \"ha-sync-mode\": \"automatic\"}" with priority "0" ... 2018-02-05 02:05:44 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment. 2018-02-05 02:05:44 DEBUG juju-log All 3 rabbit units reporting clustered 2018-02-05 02:05:45 DEBUG update-status Error: unable to connect to node 'rabbit@juju-a8345d-21-lxd-0': nodedown 2018-02-05 02:05:45 DEBUG update-status 2018-02-05 02:05:45 DEBUG update-status DIAGNOSTICS 2018-02-05 02:05:45 DEBUG update-status =========== 2018-02-05 02:05:45 DEBUG update-status 2018-02-05 02:05:45 DEBUG update-status attempted to contact: ['rabbit@juju-a8345d-21-lxd-0'] 2018-02-05 02:05:45 DEBUG update-status 2018-02-05 02:05:45 DEBUG update-status rabbit@juju-a8345d-21-lxd-0: 2018-02-05 02:05:45 DEBUG update-status * connected to epmd (port 4369) on juju-a8345d-21-lxd-0 2018-02-05 02:05:45 DEBUG update-status * epmd reports: node 'rabbit' not running at all 2018-02-05 02:05:45 DEBUG update-status no other nodes on juju-a8345d-21-lxd-0 2018-02-05 02:05:45 DEBUG update-status * suggestion: start the node 2018-02-05 02:05:45 DEBUG update-status 2018-02-05 02:05:45 DEBUG update-status current node details: 2018-02-05 02:05:45 DEBUG update-status - node name: 'rabbitmq-cli-84965@juju-a8345d-21-lxd-0' 2018-02-05 02:05:45 DEBUG update-status - home dir: /var/lib/rabbitmq 2018-02-05 02:05:45 DEBUG update-status - cookie hash: 5teFoq/K1+djDqpfMuLHCA== 2018-02-05 02:05:45 DEBUG update-status 2018-02-05 02:05:45 DEBUG update-status Traceback (most recent call last): 2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 681, in <module> 2018-02-05 02:05:45 DEBUG update-status update_clients() 2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 147, in update_clients 2018-02-05 02:05:45 DEBUG update-status amqp_changed(relation_id=rid, remote_unit=unit) 2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 198, in amqp_changed 2018-02-05 02:05:45 DEBUG update-status admin=settings.get('admin', False)) 2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 126, in configure_amqp 2018-02-05 02:05:45 DEBUG update-status rabbit.create_user(username, password, admin) 2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 216, in create_user 2018-02-05 02:05:45 DEBUG update-status exists, is_admin = user_exists(user) 2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 206, in user_exists 2018-02-05 02:05:45 DEBUG update-status out = subprocess.check_output(cmd) 2018-02-05 02:05:45 DEBUG update-status File "/usr/lib/python2.7/subprocess.py", line 574, in check_output 2018-02-05 02:05:45 DEBUG update-status raise CalledProcessError(retcode, cmd, output=output) 2018-02-05 02:05:45 DEBUG update-status subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'list_users']' returned non-zero exit status 2 2018-02-05 02:05:45 ERROR juju.worker.uniter.operation runhook.go:107 hook "update-status" failed: exit status 1 2018-02-05 02:10:45 DEBUG juju-log Hardening function 'install' 2018-02-05 02:10:45 DEBUG juju-log Hardening function 'upgrade_charm' 2018-02-05 02:10:45 DEBUG juju-log Hardening function 'config_changed' 2018-02-05 02:10:45 DEBUG juju-log Hardening function 'update_status' 2018-02-05 02:10:45 DEBUG juju-log No hardening applied to 'update_status' 2018-02-05 02:10:45 INFO juju-log Updating status. 2018-02-05 02:10:49 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment. ... 2018-02-05 02:46:18 DEBUG juju-log All 3 rabbit units reporting clustered 2018-02-05 02:46:18 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment. 2018-02-05 02:46:18 DEBUG juju-log All 3 rabbit units reporting clustered 2018-02-05 02:46:19 DEBUG juju-log Granting permissions 2018-02-05 02:46:19 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_permissions', '-p', u'openstack', u'cinder', '.*', '.*', '.*'] 2018-02-05 02:46:19 DEBUG update-status Setting permissions for user "cinder" in vhost "openstack" ... 2018-02-05 02:46:20 INFO juju-log Setting HA policy to vhost 'openstack' 2018-02-05 02:46:20 DEBUG juju-log setting policy 2018-02-05 02:46:20 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_policy', '-p', u'openstack', 'HA', '^(?!amq\\.).*', '{"ha-mode": "all", "ha-sync-mode": "automatic"}'] 2018-02-05 02:46:20 DEBUG update-status Setting policy "HA" for pattern "^(?!amq\\.).*" to "{\"ha-mode\": \"all\", \"ha-sync-mode\": \"automatic\"}" with priority "0" ... 2018-02-05 02:46:21 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment. ================================ In the rabbit logs (/var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0.log) we see it abruptly stop logging at the time of death and then start up again with the unit starting up at recovery: =INFO REPORT==== 5-Feb-2018::02:05:40 === Setting permissions for 'nova' in 'openstack' to '.*', '.*', '.*' =INFO REPORT==== 5-Feb-2018::02:05:42 === Setting permissions for 'neutron' in 'openstack' to '.*', '.*', '.*' =INFO REPORT==== 5-Feb-2018::02:45:55 === Starting RabbitMQ 3.5.7 on Erlang 18.3 Copyright (C) 2007-2015 Pivotal Software, Inc. Licensed under the MPL. See http://www.rabbitmq.com/ =INFO REPORT==== 5-Feb-2018::02:45:55 === node : rabbit@juju-a8345d-21-lxd-0 home dir : /var/lib/rabbitmq config file(s) : /etc/rabbitmq/rabbitmq.config cookie hash : 5teFoq/K1+djDqpfMuLHCA== log : /var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0.log sasl log : /var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0-sasl.log database dir : /var/lib/rabbitmq/mnesia/rabbit@juju-a8345d-21-lxd-0 =INFO REPORT==== 5-Feb-2018::02:45:55 === Memory limit set to 206389MB of 515974MB total. =INFO REPORT==== 5-Feb-2018::02:45:55 === Disk free limit set to 50MB =INFO REPORT==== 5-Feb-2018::02:45:55 === Limiting to approx 65436 file handles (58890 sockets) =INFO REPORT==== 5-Feb-2018::02:45:55 === FHC read buffering: ON FHC write buffering: ON ================================ syslog from time of failure to time of recovery: Feb 5 02:05:44 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480381 of user rabbitmq. Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480382 of user rabbitmq. Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Main process exited, code=exited, status=140/n/a Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: Stopping and halting node 'rabbit@juju-a8345d-21-lxd-0' ... Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: Error: unable to connect to node 'rabbit@juju-a8345d-21-lxd-0': nodedown Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: DIAGNOSTICS Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: =========== Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: attempted to contact: ['rabbit@juju-a8345d-21-lxd-0'] Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: rabbit@juju-a8345d-21-lxd-0: Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * connected to epmd (port 4369) on juju-a8345d-21-lxd-0 Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * epmd reports: node 'rabbit' not running at all Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: no other nodes on juju-a8345d-21-lxd-0 Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * suggestion: start the node Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: current node details: Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - node name: 'rabbitmq-cli-85163@juju-a8345d-21-lxd-0' Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - home dir: . Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - cookie hash: 5teFoq/K1+djDqpfMuLHCA== Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Control process exited, code=exited status=2 Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Unit entered failed state. Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'. Feb 5 02:10:01 juju-a8345d-21-lxd-0 CRON[85416]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:10:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20666 of user rabbitmq. Feb 5 02:10:01 juju-a8345d-21-lxd-0 CRON[85415]: (CRON) info (No MTA installed, discarding output) Feb 5 02:10:49 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480383 of user rabbitmq. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000... Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480384 of user ubuntu. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Timers. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Paths. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Sockets. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Basic System. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Default. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Startup finished in 4ms. Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000... Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Default. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Basic System. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Paths. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Timers. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Sockets. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Shutdown. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Starting Exit the Session... Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Received SIGRTMIN+24 from PID 87426 (kill). Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000. Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000... Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480385 of user ubuntu. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Paths. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Sockets. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Timers. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Basic System. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Default. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Startup finished in 4ms. Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000... Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Default. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Basic System. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Timers. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Sockets. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Paths. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Shutdown. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Starting Exit the Session... Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Received SIGRTMIN+24 from PID 87676 (kill). Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000. Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu. Feb 5 02:15:01 juju-a8345d-21-lxd-0 CRON[87691]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:15:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20667 of user rabbitmq. Feb 5 02:15:02 juju-a8345d-21-lxd-0 CRON[87690]: (CRON) info (No MTA installed, discarding output) Feb 5 02:15:53 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480386 of user rabbitmq. Feb 5 02:17:01 juju-a8345d-21-lxd-0 CRON[89408]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000... Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480387 of user ubuntu. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Timers. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Paths. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Sockets. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Basic System. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Default. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Startup finished in 6ms. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000... Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Default. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Basic System. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Timers. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Paths. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Shutdown. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Starting Exit the Session... Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Sockets. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Received SIGRTMIN+24 from PID 89658 (kill). Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000. Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000... Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480388 of user ubuntu. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Paths. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Sockets. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Timers. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Basic System. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Default. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Startup finished in 5ms. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000... Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Shutdown. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Starting Exit the Session... Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Default. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Basic System. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Paths. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Timers. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Sockets. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Received SIGRTMIN+24 from PID 89906 (kill). Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000. Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000... Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480389 of user ubuntu. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Timers. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Sockets. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Paths. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Basic System. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Default. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Startup finished in 4ms. Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000. Feb 5 02:20:01 juju-a8345d-21-lxd-0 CRON[90079]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:20:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20670 of user rabbitmq. Feb 5 02:20:01 juju-a8345d-21-lxd-0 CRON[90078]: (CRON) info (No MTA installed, discarding output) Feb 5 02:20:57 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480390 of user rabbitmq. Feb 5 02:25:01 juju-a8345d-21-lxd-0 CRON[91893]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:25:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20671 of user rabbitmq. Feb 5 02:25:01 juju-a8345d-21-lxd-0 CRON[91892]: (CRON) info (No MTA installed, discarding output) Feb 5 02:26:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480391 of user rabbitmq. Feb 5 02:30:01 juju-a8345d-21-lxd-0 CRON[93681]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:30:01 juju-a8345d-21-lxd-0 systemd[1]: snapd.refresh.timer: Adding 5h 28min 18.474604s random time. Feb 5 02:30:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20672 of user rabbitmq. Feb 5 02:30:01 juju-a8345d-21-lxd-0 CRON[93680]: (CRON) info (No MTA installed, discarding output) Feb 5 02:31:05 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480392 of user rabbitmq. Feb 5 02:35:01 juju-a8345d-21-lxd-0 CRON[95476]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:35:02 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20673 of user rabbitmq. Feb 5 02:35:02 juju-a8345d-21-lxd-0 CRON[95475]: (CRON) info (No MTA installed, discarding output) Feb 5 02:36:09 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480393 of user rabbitmq. Feb 5 02:38:56 juju-a8345d-21-lxd-0 /usr/lib/snapd/snapd[301]: snapmgr.go:422: No snaps to auto-refresh found Feb 5 02:38:56 juju-a8345d-21-lxd-0 snapd[301]: 2018/02/05 02:38:56.796922 snapmgr.go:422: No snaps to auto-refresh found Feb 5 02:40:01 juju-a8345d-21-lxd-0 CRON[97247]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:40:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20674 of user rabbitmq. Feb 5 02:40:01 juju-a8345d-21-lxd-0 CRON[97246]: (CRON) info (No MTA installed, discarding output) Feb 5 02:41:13 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480394 of user rabbitmq. Feb 5 02:43:56 juju-a8345d-21-lxd-0 /usr/lib/snapd/snapd[301]: snapmgr.go:496: DEBUG: Next refresh scheduled for 2018-02-05 08:25:58.256131509 +0000 UTC. Feb 5 02:45:01 juju-a8345d-21-lxd-0 CRON[99929]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh) Feb 5 02:45:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20675 of user rabbitmq. Feb 5 02:45:01 juju-a8345d-21-lxd-0 CRON[99928]: (CRON) info (No MTA installed, discarding output) Feb 5 02:45:44 juju-a8345d-21-lxd-0 systemd[1]: Failed to reset devices.list on /system.slice/rabbitmq-server.service: Operation not permitted Feb 5 02:45:44 juju-a8345d-21-lxd-0 systemd[1]: Starting RabbitMQ Messaging Server... Feb 5 02:45:44 juju-a8345d-21-lxd-0 systemd[1]: Failed to reset devices.list on /system.slice/run-user-1000.mount: Operation not permitted Feb 5 02:45:45 juju-a8345d-21-lxd-0 rabbitmq[100187]: Waiting for 'rabbit@juju-a8345d-21-lxd-0' ... Feb 5 02:45:45 juju-a8345d-21-lxd-0 rabbitmq[100187]: pid is 100191 ... Feb 5 02:46:09 juju-a8345d-21-lxd-0 systemd[1]: Started RabbitMQ Messaging Server. Feb 5 02:46:17 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480395 of user rabbitmq. Feb 5 02:46:18 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480396 of user rabbitmq. We saw a rabbimq-server unit which is the leader in a cluster with 2 others die abruptly. Cluster status on other units remained ok, except for inability to communicate with the downed node. We were unable to determine a root cause. We've seen this happen on 3 other occasions and on one we observed that before starting the rabbit service the jujud agent was in error state; afterwards it was in executing state. I do not know if that was the case this time. == LOGS ETC. == Service status after failure: https://pastebin.canonical.com/p/nbcGBmr2b5/ ================================ From juju unit log, I see repetitions similar to the first section (02:05:42-02:05:44) going back for days. At the time of the failure 02:05:45, it abruptly cuts away to reporting inability to connect. This section of enteries (02:05:45-02:10:49) basically repeats until the service is manually restarted around 02:46:18, where it goes back to the WARNING and entries resembling the first section: https://pastebin.canonical.com/p/ZGjtM647JZ/ ================================ In the rabbit logs (/var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0.log) we see it abruptly stop logging at the time of death and then start up again with the unit starting up at recovery: https://pastebin.canonical.com/p/vd5kj3CvDf/ ================================ syslog from time of failure to time of recovery: https://pastebin.canonical.com/p/pVpFTVHXC3/
2019-05-13 06:22:47 Chris MacNaughton charm-rabbitmq-server: importance Undecided High
2019-05-13 06:22:50 Chris MacNaughton charm-rabbitmq-server: status New Triaged