Comment 16 for bug 1683075

Revision history for this message
Xav Paice (xavpaice) wrote :

Another instance, no upgrades have happened recently (in fact, I'm trying to prep the site for one).

Site is running 1.25.13 on Trusty. Juju is HA on machines 0 1 and 2, and there are several other machines. Cloud is Maas 1.9.

The units residing on machine 2 (not in LXC's but on the machine itself) are in state 'failed', I have tried restarting the machine and unit agents, the machines on 0 and 1 as well, all the juju-db's, and all the rsyslog daemons.

I ran mgopurge (1.6) with all the state servers stopped.

In the logs for the unit (with the log set to TRACE) I see the following when I try to run the following:

juju run --unit ceph/1 'uptime'

2018-03-06 03:10:12 DEBUG juju.worker.uniter runlistener.go:61 RunCommands: {Commands:uptime RelationId:-1 RemoteUnitName: ForceRemoteUnit:false}
2018-03-06 03:10:12 TRACE juju.worker.uniter uniter.go:336 run commands: uptime

However the command never returns, the agents don't move away from failed status, and hooks don't run. I don't see anything in the machine log that looks related at all (can attach but there's potentially sensitive info would need scrubbing).

Also, I note there's a number of rsyslog connection attempts and frequent disconnects which could be a red herring or could be significant - e.g.
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:08 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:302 starting "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "agent" resource
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "api-caller" resource
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:108 starting rsyslog worker mode 1 for "unit-os-cs-1" ""
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:309 running "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:315 registered "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:339 "rsyslog-config-updater" manifold worker started
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.13:6514
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:12 DEBUG juju.worker.leadership tracker.go:138 os-cs/1 renewing lease for os-cs leadership
2018-03-06 03:15:12 DEBUG juju.worker.leadership tracker.go:165 checking os-cs/1 for os-cs leadership
2018-03-06 03:15:13 DEBUG juju.worker.leadership tracker.go:180 os-cs/1 confirmed for os-cs leadership until 2018-03-06 03:16:12.552651545 +0000 UTC
2018-03-06 03:15:13 INFO juju.worker.leadership tracker.go:182 os-cs/1 will renew os-cs leadership at 2018-03-06 03:15:42.552651545 +0000 UTC
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:302 starting "rsyslog-config-updater" manifold worker
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "agent" resource
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "api-caller" resource
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:108 starting rsyslog worker mode 1 for "unit-os-cs-1" ""
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:309 running "rsyslog-config-updater" manifold worker
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:315 registered "rsyslog-config-updater" manifold worker
2018-03-06 03:15:14 INFO juju.worker.dependency engine.go:339 "rsyslog-config-updater" manifold worker started
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.2.22:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.24.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.6.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.8.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.2.20:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.12:6514

At a similar time in syslog:
Mar 6 03:15:08 hostname rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1778857" x-info="http://www.rsyslog.com"] exiting on signal 15.
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 14 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1788531" x-info="http://www.rsyslog.com"] start
Mar 6 03:15:12 hostname rsyslogd-2307: warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]
Mar 6 03:15:12 hostname rsyslogd-2221: module 'imuxsock' already in this config, cannot be added
 [try http://www.rsyslog.com/e/2221 ]
Mar 6 03:15:12 hostname rsyslogd: rsyslogd's groupid changed to 104
Mar 6 03:15:12 hostname rsyslogd: rsyslogd's userid changed to 101
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 4 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 5 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 6 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 7 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 8 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 9 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 11 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 10 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 12 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 13 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar 6 03:15:15 hostname rsyslogd-2083: gnutls returned error on handshake: A TLS warning alert has been received.
 [try http://www.rsyslog.com/e/2083 ]
Mar 6 03:15:22 hostname rsyslogd-2027: imfile: could not persist state file machine-2 - data may be repeated on next startup. Is WorkDirectory set? [try http://www.rsyslog.com/e/2027 ]

I tried clearing out the rsyslog config from /etc/rsyslog.d/25-juju.conf, emptying out /var/spool/rsyslog to clean out any broken files (with rsyslog stopped), and restarting the machine agent, but the .qi etc files all came back immediately as did these errors.