Failed to sign on to LRMd with Heartbeat/Pacemaker
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cluster-glue (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Trusty |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
I'm running a 2 node heartbeat/pacemaker cluster, which was working fine with Ubuntu 13.04
After upgrading from Ubuntu 13.04 to Ubuntu 13.10, Heartbeat/Pacemaker keeps restarting the system due to sign on errors of lrmd and heartbeat tries to recover.
As one system is already on ubuntu 13.10 and one system still running 13.04, I've tried it without the second node, which leads to the same behavior, which occurs before any cluster communication happens.
Syslog:
Nov 14 15:53:06 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 1 (30 max) times
Nov 14 15:53:06 wolverine crmd[2464]: notice: crmd_client_
Nov 14 15:53:06 wolverine crmd[2464]: notice: crmd_client_
Nov 14 15:53:06 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 2 (30 max) times
Nov 14 15:53:06 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 3 (30 max) times
Nov 14 15:53:07 wolverine stonith-ng[2462]: notice: setup_cib: Watching for stonith topology changes
Nov 14 15:53:07 wolverine stonith-ng[2462]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov 14 15:53:08 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 4 (30 max) times
Nov 14 15:53:10 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 5 (30 max) times
Nov 14 15:53:12 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times
Nov 14 15:53:14 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 7 (30 max) times
Nov 14 15:53:16 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 8 (30 max) times
Nov 14 15:53:18 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 9 (30 max) times
Nov 14 15:53:20 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 10 (30 max) times
Nov 14 15:53:22 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 11 (30 max) times
Nov 14 15:53:24 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 12 (30 max) times
Nov 14 15:53:26 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 13 (30 max) times
Nov 14 15:53:28 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 14 (30 max) times
Nov 14 15:53:30 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 15 (30 max) times
Nov 14 15:53:32 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 16 (30 max) times
Nov 14 15:53:34 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 17 (30 max) times
Nov 14 15:53:36 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 18 (30 max) times
Nov 14 15:53:38 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 19 (30 max) times
Nov 14 15:53:40 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 20 (30 max) times
Nov 14 15:53:42 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 21 (30 max) times
Nov 14 15:53:44 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 22 (30 max) times
Nov 14 15:53:46 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 23 (30 max) times
Nov 14 15:53:48 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 24 (30 max) times
Nov 14 15:53:50 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 25 (30 max) times
Nov 14 15:53:52 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 26 (30 max) times
Nov 14 15:53:54 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 27 (30 max) times
Nov 14 15:53:56 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 28 (30 max) times
Nov 14 15:53:58 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 29 (30 max) times
Nov 14 15:54:00 wolverine crmd[2464]: error: do_lrm_control: Failed to sign on to the LRM 30 (max) times
Nov 14 15:54:00 wolverine crmd[2464]: error: do_log: FSA: Input I_ERROR from do_lrm_control() received in state S_STARTING
Nov 14 15:54:00 wolverine crmd[2464]: notice: do_state_
Nov 14 15:54:00 wolverine crmd[2464]: warning: do_recover: Fast-tracking shutdown in response to errors
Symlinking lrmd from pacemaker package solved this problem partly:
root@wolverine ~ # mv /usr/lib/
root@wolverine ~ # cd /usr/lib/heartbeat/
root@wolverine /usr/lib/heartbeat # ln -s ../pacemaker/lrmd
root@wolverine /usr/lib/heartbeat # ls -la lrmd
lrwxrwxrwx 1 root root 17 Nov 14 16:35 lrmd -> ../pacemaker/lrmd
root@wolverine /usr/lib/heartbeat # ls -la lrmd*
lrwxrwxrwx 1 root root 17 Nov 14 16:35 lrmd -> ../pacemaker/lrmd
-rwxr-xr-x 1 root root 92816 Jul 18 17:55 lrmd.cluster-glue
root@wolverine /usr/lib/heartbeat #
Stopping heartbeat will still result in an unexpected reboot:
Nov 14 16:37:27 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:
Nov 14 16:37:28 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:
Nov 14 16:37:28 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2673 failed authorization [no default client auth]
Nov 14 16:37:28 wolverine heartbeat: [2238]: ERROR: api_process_
Nov 14 16:37:28 wolverine attrd[2258]: notice: attrd_trigger_
Nov 14 16:37:28 wolverine attrd[2258]: notice: attrd_perform_
Nov 14 16:37:28 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:
Nov 14 16:37:29 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2700 failed authorization [no default client auth]
Nov 14 16:37:29 wolverine heartbeat: [2238]: ERROR: api_process_
Nov 14 16:37:29 wolverine attrd[2258]: notice: attrd_trigger_
Nov 14 16:37:29 wolverine attrd[2258]: notice: attrd_perform_
Nov 14 16:37:29 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:
Nov 14 16:37:59 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2812 failed authorization [no default client auth]
Nov 14 16:37:59 wolverine heartbeat: [2238]: ERROR: api_process_
Nov 14 16:38:00 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2839 failed authorization [no default client auth]
Nov 14 16:38:00 wolverine heartbeat: [2238]: ERROR: api_process_
Nov 14 16:38:05 wolverine heartbeat: [2238]: info: killing /usr/lib/
Nov 14 16:38:05 wolverine crmd[2259]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Nov 14 16:38:05 wolverine attrd[2258]: notice: attrd_trigger_
Nov 14 16:38:05 wolverine attrd[2258]: notice: attrd_perform_
Nov 14 16:38:06 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:
Nov 14 16:38:06 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:
Nov 14 16:38:07 wolverine kernel: [ 255.385984] d-con backup: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7)
Nov 14 16:38:07 wolverine kernel: [ 255.386415] d-con backup: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown )
Nov 14 16:38:07 wolverine kernel: [ 255.386428] d-con backup: asender terminated
Nov 14 16:38:07 wolverine kernel: [ 255.386438] d-con backup: Terminating drbd_a_backup
Nov 14 16:38:07 wolverine kernel: [ 255.386693] d-con backup: Connection closed
Nov 14 16:38:07 wolverine kernel: [ 255.386716] d-con backup: conn( Disconnecting -> StandAlone )
Nov 14 16:38:07 wolverine kernel: [ 255.386718] d-con backup: receiver terminated
Nov 14 16:38:07 wolverine kernel: [ 255.386722] d-con backup: Terminating drbd_r_backup
Nov 14 16:38:07 wolverine kernel: [ 255.386750] block drbd0: disk( Outdated -> Failed )
Nov 14 16:38:07 wolverine kernel: [ 255.409861] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Nov 14 16:38:07 wolverine kernel: [ 255.409930] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Nov 14 16:38:07 wolverine kernel: [ 255.409943] block drbd0: disk( Failed -> Diskless )
Nov 14 16:38:07 wolverine kernel: [ 255.410041] block drbd0: drbd_bm_resize called with capacity == 0
Nov 14 16:38:07 wolverine kernel: [ 255.411773] d-con backup: Terminating drbd_w_backup
Nov 14 16:38:07 wolverine kernel: [ 255.466428] d-con rsyslog: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7)
Nov 14 16:38:07 wolverine kernel: [ 255.466796] d-con rsyslog: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown )
Nov 14 16:38:07 wolverine kernel: [ 255.466814] d-con rsyslog: asender terminated
Nov 14 16:38:07 wolverine kernel: [ 255.466832] d-con rsyslog: Terminating drbd_a_rsyslog
Nov 14 16:38:07 wolverine kernel: [ 255.467098] d-con rsyslog: Connection closed
Nov 14 16:38:07 wolverine kernel: [ 255.467121] d-con rsyslog: conn( Disconnecting -> StandAlone )
Nov 14 16:38:07 wolverine kernel: [ 255.467123] d-con rsyslog: receiver terminated
Nov 14 16:38:07 wolverine kernel: [ 255.467128] d-con rsyslog: Terminating drbd_r_rsyslog
Nov 14 16:38:07 wolverine kernel: [ 255.467169] block drbd1: disk( Outdated -> Failed )
Nov 14 16:38:07 wolverine kernel: [ 255.481716] block drbd1: bitmap WRITE of 0 pages took 0 jiffies
Nov 14 16:38:07 wolverine kernel: [ 255.481778] block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Nov 14 16:38:07 wolverine kernel: [ 255.481791] block drbd1: disk( Failed -> Diskless )
Nov 14 16:38:07 wolverine kernel: [ 255.481881] block drbd1: drbd_bm_resize called with capacity == 0
Nov 14 16:38:07 wolverine kernel: [ 255.482011] d-con rsyslog: Terminating drbd_w_rsyslog
Nov 14 16:38:07 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2986 failed authorization [no default client auth]
Nov 14 16:38:07 wolverine heartbeat: [2238]: ERROR: api_process_
Nov 14 16:38:07 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2989 failed authorization [no default client auth]
Nov 14 16:38:07 wolverine heartbeat: [2238]: ERROR: api_process_
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_trigger_
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_
Nov 14 16:38:07 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_trigger_
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_
Nov 14 16:38:07 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_
Nov 14 16:38:08 wolverine crmd[2259]: notice: do_state_
Nov 14 16:38:08 wolverine crmd[2259]: notice: lrm_state_
Nov 14 16:38:08 wolverine crmd[2259]: notice: do_lrm_control: Disconnected from the LRM
Nov 14 16:38:08 wolverine ccm: [2254]: info: client (pid=2259) removed from ccm
Nov 14 16:38:08 wolverine heartbeat: [2238]: EMERG: Rebooting system. Reason: /usr/lib/
root@wolverine ~ # lsb_release -rd
Description: Ubuntu 13.10
Release: 13.10
root@wolverine ~ # apt-cache policy cluster-glue
cluster-glue:
Installed: 1.0.11+hg2754-1.1
Candidate: 1.0.11+hg2754-1.1
Version table:
*** 1.0.11+hg2754-1.1 0
500 http://
100 /var/lib/
root@wolverine ~ #
root@wolverine ~ # apt-cache policy heartbeat
heartbeat:
Installed: 1:3.0.5-3.1ubuntu1
Candidate: 1:3.0.5-3.1ubuntu1
Version table:
*** 1:3.0.5-3.1ubuntu1 0
500 http://
100 /var/lib/
root@wolverine ~ #
root@wolverine ~ # apt-cache policy pacemaker
pacemaker:
Installed: 1.1.10+
Candidate: 1.1.10+
Version table:
*** 1.1.10+
500 http://
100 /var/lib/
root@wolverine ~ #
Expected:
- Working heartbeat/pacemaker setup after ubuntu upgrade
What happend:
- System reboots after about one minute due to heartbeat recovery tries
Changed in cluster-glue (Ubuntu): | |
assignee: | Andres Rodriguez (andreserl) → nobody |
Changed in cluster-glue (Ubuntu): | |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
tags: | added: ubuntu-ha |
Status changed to 'Confirmed' because the bug affects multiple users.