2018-08-07 04:44:22 |
Paul Gear |
description |
We saw a rabbimq-server unit which is the leader in a cluster with 2 others die abruptly. Cluster status on other units remained ok, except for inability to communicate with the downed node. We were unable to determine a root cause. We've seen this happen on 3 other occasions and on one we observed that before starting the rabbit service the jujud agent was in error state; afterwards it was in executing state. I do not know if that was the case this time.
== LOGS ETC. ==
Service status after failure:
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2018-02-05 02:05:45 UTC; 17min ago
Process: 85156 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=2)
Process: 1663808 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SUCCESS)
Process: 1663807 ExecStart=/usr/sbin/rabbitmq-server (code=exited, status=140)
Main PID: 1663807 (code=exited, status=140)
Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * epmd reports: node 'rabbit' not running at all
Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: no other nodes on juju-a8345d-21-lxd-0
Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * suggestion: start the node
Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: current node details:
Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - node name: 'rabbitmq-cli-85163@juju-a8345d-21-lxd-0'
Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - home dir: .
Feb 05 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - cookie hash: 5teFoq/K1+djDqpfMuLHCA==
Feb 05 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Control process exited, code=exited status=2
Feb 05 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Unit entered failed state.
Feb 05 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
================================
From juju unit log, I see repetitions similar to the first section (02:05:42-02:05:44) going back for days. At the time of the failure 02:05:45, it abruptly cuts away to reporting inability to connect. This section of enteries (02:05:45-02:10:49) basically repeats until the service is manually restarted around 02:46:18, where it goes back to the WARNING and entries resembling the first section:
2018-02-05 02:05:42 DEBUG juju-log Granting permissions
2018-02-05 02:05:42 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_permissions', '-p', u'openstack', u'neutron', '.*', '.*', '.*']
2018-02-05 02:05:42 DEBUG update-status Setting permissions for user "neutron" in vhost "openstack" ...
2018-02-05 02:05:42 INFO juju-log Setting HA policy to vhost 'openstack'
2018-02-05 02:05:42 DEBUG juju-log setting policy
2018-02-05 02:05:42 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_policy', '-p', u'openstack', 'HA', '^(?!amq\\.).*', '{"ha-mode": "all", "ha-sync-mode": "automatic"}']
2018-02-05 02:05:43 DEBUG update-status Setting policy "HA" for pattern "^(?!amq\\.).*" to "{\"ha-mode\": \"all\", \"ha-sync-mode\": \"automatic\"}" with priority "0" ...
2018-02-05 02:05:44 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.
2018-02-05 02:05:44 DEBUG juju-log All 3 rabbit units reporting clustered
2018-02-05 02:05:45 DEBUG update-status Error: unable to connect to node 'rabbit@juju-a8345d-21-lxd-0': nodedown
2018-02-05 02:05:45 DEBUG update-status
2018-02-05 02:05:45 DEBUG update-status DIAGNOSTICS
2018-02-05 02:05:45 DEBUG update-status ===========
2018-02-05 02:05:45 DEBUG update-status
2018-02-05 02:05:45 DEBUG update-status attempted to contact: ['rabbit@juju-a8345d-21-lxd-0']
2018-02-05 02:05:45 DEBUG update-status
2018-02-05 02:05:45 DEBUG update-status rabbit@juju-a8345d-21-lxd-0:
2018-02-05 02:05:45 DEBUG update-status * connected to epmd (port 4369) on juju-a8345d-21-lxd-0
2018-02-05 02:05:45 DEBUG update-status * epmd reports: node 'rabbit' not running at all
2018-02-05 02:05:45 DEBUG update-status no other nodes on juju-a8345d-21-lxd-0
2018-02-05 02:05:45 DEBUG update-status * suggestion: start the node
2018-02-05 02:05:45 DEBUG update-status
2018-02-05 02:05:45 DEBUG update-status current node details:
2018-02-05 02:05:45 DEBUG update-status - node name: 'rabbitmq-cli-84965@juju-a8345d-21-lxd-0'
2018-02-05 02:05:45 DEBUG update-status - home dir: /var/lib/rabbitmq
2018-02-05 02:05:45 DEBUG update-status - cookie hash: 5teFoq/K1+djDqpfMuLHCA==
2018-02-05 02:05:45 DEBUG update-status
2018-02-05 02:05:45 DEBUG update-status Traceback (most recent call last):
2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 681, in <module>
2018-02-05 02:05:45 DEBUG update-status update_clients()
2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 147, in update_clients
2018-02-05 02:05:45 DEBUG update-status amqp_changed(relation_id=rid, remote_unit=unit)
2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 198, in amqp_changed
2018-02-05 02:05:45 DEBUG update-status admin=settings.get('admin', False))
2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/update-status", line 126, in configure_amqp
2018-02-05 02:05:45 DEBUG update-status rabbit.create_user(username, password, admin)
2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 216, in create_user
2018-02-05 02:05:45 DEBUG update-status exists, is_admin = user_exists(user)
2018-02-05 02:05:45 DEBUG update-status File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 206, in user_exists
2018-02-05 02:05:45 DEBUG update-status out = subprocess.check_output(cmd)
2018-02-05 02:05:45 DEBUG update-status File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
2018-02-05 02:05:45 DEBUG update-status raise CalledProcessError(retcode, cmd, output=output)
2018-02-05 02:05:45 DEBUG update-status subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'list_users']' returned non-zero exit status 2
2018-02-05 02:05:45 ERROR juju.worker.uniter.operation runhook.go:107 hook "update-status" failed: exit status 1
2018-02-05 02:10:45 DEBUG juju-log Hardening function 'install'
2018-02-05 02:10:45 DEBUG juju-log Hardening function 'upgrade_charm'
2018-02-05 02:10:45 DEBUG juju-log Hardening function 'config_changed'
2018-02-05 02:10:45 DEBUG juju-log Hardening function 'update_status'
2018-02-05 02:10:45 DEBUG juju-log No hardening applied to 'update_status'
2018-02-05 02:10:45 INFO juju-log Updating status.
2018-02-05 02:10:49 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.
...
2018-02-05 02:46:18 DEBUG juju-log All 3 rabbit units reporting clustered
2018-02-05 02:46:18 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.
2018-02-05 02:46:18 DEBUG juju-log All 3 rabbit units reporting clustered
2018-02-05 02:46:19 DEBUG juju-log Granting permissions
2018-02-05 02:46:19 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_permissions', '-p', u'openstack', u'cinder', '.*', '.*', '.*']
2018-02-05 02:46:19 DEBUG update-status Setting permissions for user "cinder" in vhost "openstack" ...
2018-02-05 02:46:20 INFO juju-log Setting HA policy to vhost 'openstack'
2018-02-05 02:46:20 DEBUG juju-log setting policy
2018-02-05 02:46:20 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_policy', '-p', u'openstack', 'HA', '^(?!amq\\.).*', '{"ha-mode": "all", "ha-sync-mode": "automatic"}']
2018-02-05 02:46:20 DEBUG update-status Setting policy "HA" for pattern "^(?!amq\\.).*" to "{\"ha-mode\": \"all\", \"ha-sync-mode\": \"automatic\"}" with priority "0" ...
2018-02-05 02:46:21 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.
================================
In the rabbit logs (/var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0.log) we see it abruptly stop logging at the time of death and then start up again with the unit starting up at recovery:
=INFO REPORT==== 5-Feb-2018::02:05:40 ===
Setting permissions for 'nova' in 'openstack' to '.*', '.*', '.*'
=INFO REPORT==== 5-Feb-2018::02:05:42 ===
Setting permissions for 'neutron' in 'openstack' to '.*', '.*', '.*'
=INFO REPORT==== 5-Feb-2018::02:45:55 ===
Starting RabbitMQ 3.5.7 on Erlang 18.3
Copyright (C) 2007-2015 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
=INFO REPORT==== 5-Feb-2018::02:45:55 ===
node : rabbit@juju-a8345d-21-lxd-0
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash : 5teFoq/K1+djDqpfMuLHCA==
log : /var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0.log
sasl log : /var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0-sasl.log
database dir : /var/lib/rabbitmq/mnesia/rabbit@juju-a8345d-21-lxd-0
=INFO REPORT==== 5-Feb-2018::02:45:55 ===
Memory limit set to 206389MB of 515974MB total.
=INFO REPORT==== 5-Feb-2018::02:45:55 ===
Disk free limit set to 50MB
=INFO REPORT==== 5-Feb-2018::02:45:55 ===
Limiting to approx 65436 file handles (58890 sockets)
=INFO REPORT==== 5-Feb-2018::02:45:55 ===
FHC read buffering: ON
FHC write buffering: ON
================================
syslog from time of failure to time of recovery:
Feb 5 02:05:44 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480381 of user rabbitmq.
Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480382 of user rabbitmq.
Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Main process exited, code=exited, status=140/n/a
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: Stopping and halting node 'rabbit@juju-a8345d-21-lxd-0' ...
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: Error: unable to connect to node 'rabbit@juju-a8345d-21-lxd-0': nodedown
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: DIAGNOSTICS
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: ===========
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: attempted to contact: ['rabbit@juju-a8345d-21-lxd-0']
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: rabbit@juju-a8345d-21-lxd-0:
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * connected to epmd (port 4369) on juju-a8345d-21-lxd-0
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * epmd reports: node 'rabbit' not running at all
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: no other nodes on juju-a8345d-21-lxd-0
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: * suggestion: start the node
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: current node details:
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - node name: 'rabbitmq-cli-85163@juju-a8345d-21-lxd-0'
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - home dir: .
Feb 5 02:05:45 juju-a8345d-21-lxd-0 rabbitmq[85156]: - cookie hash: 5teFoq/K1+djDqpfMuLHCA==
Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Control process exited, code=exited status=2
Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Unit entered failed state.
Feb 5 02:05:45 juju-a8345d-21-lxd-0 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
Feb 5 02:10:01 juju-a8345d-21-lxd-0 CRON[85416]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:10:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20666 of user rabbitmq.
Feb 5 02:10:01 juju-a8345d-21-lxd-0 CRON[85415]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:10:49 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480383 of user rabbitmq.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000...
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480384 of user ubuntu.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Timers.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Paths.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Sockets.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Basic System.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Default.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[87149]: Startup finished in 4ms.
Feb 5 02:12:12 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000...
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Default.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Basic System.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Paths.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Timers.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Stopped target Sockets.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Reached target Shutdown.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Starting Exit the Session...
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[87149]: Received SIGRTMIN+24 from PID 87426 (kill).
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000.
Feb 5 02:13:45 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000...
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480385 of user ubuntu.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Paths.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Sockets.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Timers.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Basic System.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Default.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[87437]: Startup finished in 4ms.
Feb 5 02:14:20 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000...
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Default.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Basic System.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Timers.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Sockets.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Stopped target Paths.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Reached target Shutdown.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Starting Exit the Session...
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[87437]: Received SIGRTMIN+24 from PID 87676 (kill).
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000.
Feb 5 02:14:21 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu.
Feb 5 02:15:01 juju-a8345d-21-lxd-0 CRON[87691]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:15:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20667 of user rabbitmq.
Feb 5 02:15:02 juju-a8345d-21-lxd-0 CRON[87690]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:15:53 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480386 of user rabbitmq.
Feb 5 02:17:01 juju-a8345d-21-lxd-0 CRON[89408]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000...
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480387 of user ubuntu.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Timers.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Paths.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Sockets.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Basic System.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Default.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Startup finished in 6ms.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000...
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Default.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Basic System.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Timers.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Paths.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Reached target Shutdown.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Starting Exit the Session...
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Stopped target Sockets.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[89414]: Received SIGRTMIN+24 from PID 89658 (kill).
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000.
Feb 5 02:17:07 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000...
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480388 of user ubuntu.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Paths.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Sockets.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Timers.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Basic System.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Default.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Startup finished in 5ms.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Stopping User Manager for UID 1000...
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Reached target Shutdown.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Starting Exit the Session...
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Default.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Basic System.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Paths.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Timers.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Stopped target Sockets.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[89663]: Received SIGRTMIN+24 from PID 89906 (kill).
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Stopped User Manager for UID 1000.
Feb 5 02:17:19 juju-a8345d-21-lxd-0 systemd[1]: Removed slice User Slice of ubuntu.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Created slice User Slice of ubuntu.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Starting User Manager for UID 1000...
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480389 of user ubuntu.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Timers.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Sockets.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Paths.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Basic System.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Reached target Default.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[89914]: Startup finished in 4ms.
Feb 5 02:18:16 juju-a8345d-21-lxd-0 systemd[1]: Started User Manager for UID 1000.
Feb 5 02:20:01 juju-a8345d-21-lxd-0 CRON[90079]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:20:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20670 of user rabbitmq.
Feb 5 02:20:01 juju-a8345d-21-lxd-0 CRON[90078]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:20:57 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480390 of user rabbitmq.
Feb 5 02:25:01 juju-a8345d-21-lxd-0 CRON[91893]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:25:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20671 of user rabbitmq.
Feb 5 02:25:01 juju-a8345d-21-lxd-0 CRON[91892]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:26:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480391 of user rabbitmq.
Feb 5 02:30:01 juju-a8345d-21-lxd-0 CRON[93681]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:30:01 juju-a8345d-21-lxd-0 systemd[1]: snapd.refresh.timer: Adding 5h 28min 18.474604s random time.
Feb 5 02:30:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20672 of user rabbitmq.
Feb 5 02:30:01 juju-a8345d-21-lxd-0 CRON[93680]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:31:05 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480392 of user rabbitmq.
Feb 5 02:35:01 juju-a8345d-21-lxd-0 CRON[95476]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:35:02 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20673 of user rabbitmq.
Feb 5 02:35:02 juju-a8345d-21-lxd-0 CRON[95475]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:36:09 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480393 of user rabbitmq.
Feb 5 02:38:56 juju-a8345d-21-lxd-0 /usr/lib/snapd/snapd[301]: snapmgr.go:422: No snaps to auto-refresh found
Feb 5 02:38:56 juju-a8345d-21-lxd-0 snapd[301]: 2018/02/05 02:38:56.796922 snapmgr.go:422: No snaps to auto-refresh found
Feb 5 02:40:01 juju-a8345d-21-lxd-0 CRON[97247]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:40:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20674 of user rabbitmq.
Feb 5 02:40:01 juju-a8345d-21-lxd-0 CRON[97246]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:41:13 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480394 of user rabbitmq.
Feb 5 02:43:56 juju-a8345d-21-lxd-0 /usr/lib/snapd/snapd[301]: snapmgr.go:496: DEBUG: Next refresh scheduled for 2018-02-05 08:25:58.256131509 +0000 UTC.
Feb 5 02:45:01 juju-a8345d-21-lxd-0 CRON[99929]: (root) CMD (/usr/local/bin/collect_rabbitmq_stats.sh)
Feb 5 02:45:01 juju-a8345d-21-lxd-0 systemd[1]: Started Session 20675 of user rabbitmq.
Feb 5 02:45:01 juju-a8345d-21-lxd-0 CRON[99928]: (CRON) info (No MTA installed, discarding output)
Feb 5 02:45:44 juju-a8345d-21-lxd-0 systemd[1]: Failed to reset devices.list on /system.slice/rabbitmq-server.service: Operation not permitted
Feb 5 02:45:44 juju-a8345d-21-lxd-0 systemd[1]: Starting RabbitMQ Messaging Server...
Feb 5 02:45:44 juju-a8345d-21-lxd-0 systemd[1]: Failed to reset devices.list on /system.slice/run-user-1000.mount: Operation not permitted
Feb 5 02:45:45 juju-a8345d-21-lxd-0 rabbitmq[100187]: Waiting for 'rabbit@juju-a8345d-21-lxd-0' ...
Feb 5 02:45:45 juju-a8345d-21-lxd-0 rabbitmq[100187]: pid is 100191 ...
Feb 5 02:46:09 juju-a8345d-21-lxd-0 systemd[1]: Started RabbitMQ Messaging Server.
Feb 5 02:46:17 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480395 of user rabbitmq.
Feb 5 02:46:18 juju-a8345d-21-lxd-0 systemd[1]: Started Session c5480396 of user rabbitmq. |
We saw a rabbimq-server unit which is the leader in a cluster with 2 others die abruptly. Cluster status on other units remained ok, except for inability to communicate with the downed node. We were unable to determine a root cause. We've seen this happen on 3 other occasions and on one we observed that before starting the rabbit service the jujud agent was in error state; afterwards it was in executing state. I do not know if that was the case this time.
== LOGS ETC. ==
Service status after failure:
https://pastebin.canonical.com/p/nbcGBmr2b5/
================================
From juju unit log, I see repetitions similar to the first section (02:05:42-02:05:44) going back for days. At the time of the failure 02:05:45, it abruptly cuts away to reporting inability to connect. This section of enteries (02:05:45-02:10:49) basically repeats until the service is manually restarted around 02:46:18, where it goes back to the WARNING and entries resembling the first section:
https://pastebin.canonical.com/p/ZGjtM647JZ/
================================
In the rabbit logs (/var/log/rabbitmq/rabbit@juju-a8345d-21-lxd-0.log) we see it abruptly stop logging at the time of death and then start up again with the unit starting up at recovery:
https://pastebin.canonical.com/p/vd5kj3CvDf/
================================
syslog from time of failure to time of recovery:
https://pastebin.canonical.com/p/pVpFTVHXC3/ |
|