Pacemaker "crm node standby" stops resource successfully, but lrmd still monitors it and causes "Failed actions"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pacemaker (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Trusty |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
* Whenever a user uses "crm node standby" the code can make lrmd still
try to monitor resource put into stand-by and cause error messages.
[Test Case]
* To use "crm node standby" and check lrmd does not stop monitoring
not set to stand-by.
[Regression Potential]
* users already tested and are using in production.
* based on upstream fixes for lrmd monitoring.
* potential race conditions (based on upstream history).
[Other Info]
* Original bug description:
----------------
It was brought to me (~inaddy) the following situation:
""""""
* Environment
Ubuntu 14.04 LTS
Pacemaker 1.1.10+
* Priority
High
* Issue
I used "crm node standby" and the resource(haproxy) was stopped successfully. But lrmd still monitors it and causes "Failed actions".
-------
Node A1LB101 (167969461): standby
Online: [ A1LB102 ]
Resource Group: grpHaproxy
vip-internal (ocf::heartbeat
vip-external (ocf::heartbeat
vip-nfs (ocf::heartbeat
vip-iscsi (ocf::heartbeat
Resource Group: grpStonith1
prmStonith1-1 (stonith:
Clone Set: clnHaproxy [haproxy]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]
Clone Set: clnPing [ping]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]
Node Attributes:
* Node A1LB101:
* Node A1LB102:
+ default_ping_set : 400
Migration summary:
* Node A1LB101:
haproxy: migration-
* Node A1LB102:
Failed actions:
haproxy_
, queued=0ms, exec=0ms
): not running
-------
Abstract from log (ha-log.node1)
Jul 7 20:28:50 A1LB101 crmd[6364]: notice: te_rsc_command: Initiating action 42: stop haproxy_stop_0 on A1LB101 (local)
Jul 7 20:28:50 A1LB101 crmd[6364]: info: match_graph_event: Action haproxy_stop_0 (42) confirmed on A1LB101 (rc=0)
Jul 7 20:28:58 A1LB101 crmd[6364]: notice: process_lrm_event: A1LB101-
""""""
I wasn't able to reproduce this error so far but the fix seems a straightforward cherry-picking from upstream patch set fix:
48f90f6 Fix: services: Do not allow duplicate recurring op entries
c29ab27 High: lrmd: Merge duplicate recurring monitor operations
348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed
So I'm assuming (and testing right now) this will fix the issue... Opening the public bug for the fix I'll provide after tests, and to ask others to test the fix also.
Changed in pacemaker (Ubuntu): | |
assignee: | nobody → Rafael David Tinoco (inaddy) |
status: | New → Confirmed |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
summary: |
- Trusty Pacemaker "crm node standby" stops resource successfully, but - lrmd still monitors it and causes "Failed actions" + Pacemaker "crm node standby" stops resource successfully, but lrmd still + monitors it and causes "Failed actions" |
Changed in pacemaker (Debian): | |
status: | Unknown → New |
tags: | added: cts |
Changed in pacemaker (Ubuntu): | |
assignee: | Rafael David Tinoco (inaddy) → nobody |
no longer affects: | pacemaker (Debian) |
## After applying the fix I could successfully put one node on standby. Resources migrated correctly.
root@trustyclus ter02:~ # crm_mon ..root@ trustycluster02 :~# crm_mon -1
Connection to the CIB terminated
Reconnecting.
Last updated: Wed Aug 6 10:27:35 2014
Last change: Tue Aug 5 15:42:11 2014 via crm_attribute on trustycluster04
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured
Node trustycluster01 (739246087): standby
Online: [ trustycluster02 trustycluster03 trustycluster04 ]
p_fence_cluster01 (stonith: external/ vcenter) : Started trustycluster02 external/ vcenter) : Started trustycluster03 external/ vcenter) : Started trustycluster04 external/ vcenter) : Started trustycluster02 :IPaddr2) : Started trustycluster03
p_fence_cluster02 (stonith:
p_fence_cluster03 (stonith:
p_fence_cluster04 (stonith:
clusterip (ocf::heartbeat
## and resources were active in other nodes:
root@trustyclus ter01:~ # crm_mon -1
Last updated: Wed Aug 6 10:29:48 2014
Last change: Wed Aug 6 10:27:47 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured
Node trustycluster01 (739246087): standby
Node trustycluster03 (739246089): standby
Online: [ trustycluster02 trustycluster04 ]
p_fence_cluster01 (stonith: external/ vcenter) : Started trustycluster02 external/ vcenter) : Started trustycluster04 external/ vcenter) : Started trustycluster04 external/ vcenter) : Started trustycluster02 :IPaddr2) : Started trustycluster02
p_fence_cluster02 (stonith:
p_fence_cluster03 (stonith:
p_fence_cluster04 (stonith:
clusterip (ocf::heartbeat
## After putting nodes back online:
root@trustyclus ter01:~ # crm_mon -1
Last updated: Wed Aug 6 10:30:42 2014
Last change: Wed Aug 6 10:30:36 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured
Online: [ trustycluster01 trustycluster02 trustycluster03 trustycluster04 ]
p_fence_cluster01 (stonith: external/ vcenter) : Started trustycluster02 external/ vcenter) : Started trustycluster04 external/ vcenter) : Started trustycluster01 :IPaddr2) : Started trustycluster01
p_fence_cluster02 (stonith:
p_fence_cluster03 (stonith:
clusterip (ocf::heartbeat