Ubuntu
pacemaker package

Pacemaker "crm node standby" stops resource successfully, but lrmd still monitors it and causes "Failed actions"

Bug #1353473 reported by Rafael David Tinoco on 2014-08-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	pacemaker (Ubuntu)	Fix Released	Undecided	Unassigned
	Trusty	Fix Released	Undecided	Unassigned

Bug Description

[Impact]

* Whenever a user uses "crm node standby" the code can make lrmd still
try to monitor resource put into stand-by and cause error messages.

[Test Case]

* To use "crm node standby" and check lrmd does not stop monitoring
not set to stand-by.

[Regression Potential]

* users already tested and are using in production.
* based on upstream fixes for lrmd monitoring.
* potential race conditions (based on upstream history).

[Other Info]

* Original bug description:

----------------

It was brought to me (~inaddy) the following situation:

""""""

* Environment
Ubuntu 14.04 LTS
Pacemaker 1.1.10+git20130802-1ubuntu2

* Priority
High

* Issue
I used "crm node standby" and the resource(haproxy) was stopped successfully. But lrmd still monitors it and causes "Failed actions".

---------------------------------------
Node A1LB101 (167969461): standby
Online: [ A1LB102 ]

Resource Group: grpHaproxy
vip-internal (ocf::heartbeat:IPaddr2): Started A1LB102
vip-external (ocf::heartbeat:IPaddr2): Started A1LB102
vip-nfs (ocf::heartbeat:IPaddr2): Started A1LB102
vip-iscsi (ocf::heartbeat:IPaddr2): Started A1LB102
Resource Group: grpStonith1
prmStonith1-1 (stonith:external/stonith-helper): Started A1LB102
Clone Set: clnHaproxy [haproxy]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]
Clone Set: clnPing [ping]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]

Node Attributes:
* Node A1LB101:
* Node A1LB102:
+ default_ping_set : 400

Migration summary:
* Node A1LB101:
haproxy: migration-threshold=1 fail-count=18 last-failure='Mon Jul 7 20:28:58 2014'
* Node A1LB102:

Failed actions:
haproxy_monitor_10000 (node=A1LB101, call=2332, rc=7, status=complete, last-rc-change=Mon Jul 7 20:28:58 2014
, queued=0ms, exec=0ms
): not running
---------------------------------------

Abstract from log (ha-log.node1)
Jul 7 20:28:50 A1LB101 crmd[6364]: notice: te_rsc_command: Initiating action 42: stop haproxy_stop_0 on A1LB101 (local)
Jul 7 20:28:50 A1LB101 crmd[6364]: info: match_graph_event: Action haproxy_stop_0 (42) confirmed on A1LB101 (rc=0)
Jul 7 20:28:58 A1LB101 crmd[6364]: notice: process_lrm_event: A1LB101-haproxy_monitor_10000:1372 [ haproxy not running.\n ]

""""""

I wasn't able to reproduce this error so far but the fix seems a straightforward cherry-picking from upstream patch set fix:

48f90f6 Fix: services: Do not allow duplicate recurring op entries
c29ab27 High: lrmd: Merge duplicate recurring monitor operations
348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed

So I'm assuming (and testing right now) this will fix the issue... Opening the public bug for the fix I'll provide after tests, and to ask others to test the fix also.

See original description

Tags:

Related branches

lp://staging/ubuntu/utopic-proposed/pacemaker

lp://staging/ubuntu/trusty-proposed/pacemaker

Rafael David Tinoco (rafaeldtinoco) on 2014-08-06

Changed in pacemaker (Ubuntu):
assignee:	nobody → Rafael David Tinoco (inaddy)
status:	New → Confirmed
description:	updated

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2014-08-06:

## After applying the fix I could successfully put one node on standby. Resources migrated correctly.

root@trustycluster02:~# crm_mon
Connection to the CIB terminated
Reconnecting...root@trustycluster02:~# crm_mon -1
Last updated: Wed Aug 6 10:27:35 2014
Last change: Tue Aug 5 15:42:11 2014 via crm_attribute on trustycluster04
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Node trustycluster01 (739246087): standby
Online: [ trustycluster02 trustycluster03 trustycluster04 ]

p_fence_cluster01 (stonith:external/vcenter): Started trustycluster02
p_fence_cluster02 (stonith:external/vcenter): Started trustycluster03
p_fence_cluster03 (stonith:external/vcenter): Started trustycluster04
p_fence_cluster04 (stonith:external/vcenter): Started trustycluster02
clusterip (ocf::heartbeat:IPaddr2): Started trustycluster03

## and resources were active in other nodes:

root@trustycluster01:~# crm_mon -1
Last updated: Wed Aug 6 10:29:48 2014
Last change: Wed Aug 6 10:27:47 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Node trustycluster01 (739246087): standby
Node trustycluster03 (739246089): standby
Online: [ trustycluster02 trustycluster04 ]

p_fence_cluster01 (stonith:external/vcenter): Started trustycluster02
p_fence_cluster02 (stonith:external/vcenter): Started trustycluster04
p_fence_cluster03 (stonith:external/vcenter): Started trustycluster04
p_fence_cluster04 (stonith:external/vcenter): Started trustycluster02
clusterip (ocf::heartbeat:IPaddr2): Started trustycluster02

## After putting nodes back online:

root@trustycluster01:~# crm_mon -1
Last updated: Wed Aug 6 10:30:42 2014
Last change: Wed Aug 6 10:30:36 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Online: [ trustycluster01 trustycluster02 trustycluster03 trustycluster04 ]

p_fence_cluster01 (stonith:external/vcenter): Started trustycluster02
p_fence_cluster02 (stonith:external/vcenter): Started trustycluster04
p_fence_cluster03 (stonith:external/vcenter): Started trustycluster01
clusterip (ocf::heartbeat:IPaddr2): Started trustycluster01

## After applying the fix I could successfully put one node on standby. Resources migrated correctly.

root@trustycluster02:~# crm_mon
Connection to the CIB terminated
Reconnecting...root@trustycluster02:~# crm_mon -1
Last updated: Wed Aug  6 10:27:35 2014
Last change: Tue Aug  5 15:42:11 2014 via crm_attribute on trustycluster04
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Node trustycluster01 (739246087): standby
Online: [ trustycluster02 trustycluster03 trustycluster04 ]

p_fence_cluster01	(stonith:external/vcenter):	Started trustycluster02
 p_fence_cluster02	(stonith:external/vcenter):	Started trustycluster03
 p_fence_cluster03	(stonith:external/vcenter):	Started trustycluster04
 p_fence_cluster04	(stonith:external/vcenter):	Started trustycluster02
 clusterip	(ocf::heartbeat:IPaddr2):	Started trustycluster03

## and resources were active in other nodes:

root@trustycluster01:~# crm_mon -1
Last updated: Wed Aug  6 10:29:48 2014
Last change: Wed Aug  6 10:27:47 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Node trustycluster01 (739246087): standby
Node trustycluster03 (739246089): standby
Online: [ trustycluster02 trustycluster04 ]

p_fence_cluster01	(stonith:external/vcenter):	Started trustycluster02
 p_fence_cluster02	(stonith:external/vcenter):	Started trustycluster04
 p_fence_cluster03	(stonith:external/vcenter):	Started trustycluster04
 p_fence_cluster04	(stonith:external/vcenter):	Started trustycluster02
 clusterip	(ocf::heartbeat:IPaddr2):	Started trustycluster02

## After putting nodes back online:

root@trustycluster01:~# crm_mon -1
Last updated: Wed Aug  6 10:30:42 2014
Last change: Wed Aug  6 10:30:36 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Online: [ trustycluster01 trustycluster02 trustycluster03 trustycluster04 ]

p_fence_cluster01	(stonith:external/vcenter):	Started trustycluster02
 p_fence_cluster02	(stonith:external/vcenter):	Started trustycluster04
 p_fence_cluster03	(stonith:external/vcenter):	Started trustycluster01
 clusterip	(ocf::heartbeat:IPaddr2):	Started trustycluster01

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2014-08-06:

Created one public PPA so the SRU proposal can be tested before asking for sponsorship:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1353473

# apt-add-repository ppa:inaddy/lp1353473
# apt-get update
# apt-get dist-upgrade

* attention: this will replace current trusty pacemaker version:
pacemaker_1.1.10+git20130802-1ubuntu2

* to version:
pacemaker_1.1.10+git20130802-1ubuntu3

* because versioning is already ready for the SRU proposal.
* to get back to current trusty version you will have to remove
* the pacemaker by hand and install it again (maybe ignoring
* dependencies if you don't want to reinstall hole clustering
* packages).

After upgrading to version: pacemaker_1.1.10+git20130802-1ubuntu3

Anyone who is suffering for this issue can try to
# "crm node standby <node>"
again and check if ldmd stops monitoring resources on nodes put to standby.

Tks

Rafael David Tinoco (rafaeldtinoco) on 2014-08-08

description:

updated

Rafael David Tinoco (rafaeldtinoco) on 2014-08-08

description:

updated

Rafael David Tinoco (rafaeldtinoco) on 2014-08-08

description:

updated

Rafael David Tinoco (rafaeldtinoco) on 2014-08-08

description:

updated

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2014-08-08:

<email address hidden>:/bugs/00070403/sources/upstream$ git tag --contains 48f90f6
<email address hidden>:/bugs/00070403/sources/upstream$ git tag --contains c29ab27
<email address hidden>:/bugs/00070403/sources/upstream$ git tag --contains 348bb51

Pacemaker-1.1.12
Pacemaker-1.1.12-rc1
Pacemaker-1.1.12-rc2
Pacemaker-1.1.12-rc3
Pacemaker-1.1.12-rc4

Affects Trusty and Utopic.

description:

updated

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2014-08-08:

trusty_pacemaker_1.1.10+git20130802-1ubuntu3.debdiff Edit (21.8 KiB, text/plain)

Attaching Trusty fix.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2014-08-08:

utopic_pacemaker_1.1.10+git20130802-4ubuntu3.debdiff Edit (21.8 KiB, text/plain)

Attaching Utopic fix.

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2014-08-08:

Proposed merge to Utopic:

https://code.launchpad.net/~inaddy/ubuntu/utopic/pacemaker/bug-1353473/+merge/230169

Rafael David Tinoco (rafaeldtinoco) on 2014-08-08

summary:

- Trusty Pacemaker "crm node standby" stops resource successfully, but
- lrmd still monitors it and causes "Failed actions"
+ Pacemaker "crm node standby" stops resource successfully, but lrmd still
+ monitors it and causes "Failed actions"

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2014-08-08:

Submitted fix to Debian:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757514

Waiting for fix/merge.

Bug Watch Updater (bug-watch-updater) on 2014-08-28

Changed in pacemaker (Debian):
status:	Unknown → New

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-09-04:

#10

This bug was fixed in the package pacemaker - 1.1.10+git20130802-4ubuntu3

---------------
pacemaker (1.1.10+git20130802-4ubuntu3) utopic; urgency=medium

  * Fix: services: Do not allow duplicate recurring op entries - 1/3 (LP: #1353473)
  * High: lrmd: Merge duplicate recurring monitor operations - 2/3 (LP: #1353473)
  * Fix: lrmd: Cancel recurring operations before stop action is executed - 3/3 (LP: #1353473)
-- Rafael David Tinoco <email address hidden> Thu, 04 Sep 2014 09:58:36 -0500

Changed in pacemaker (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Marc Deslauriers (mdeslaur) wrote on 2014-09-23:

#11

ACK on the debdiff for trusty. I've uploaded it for processing by the SRU team with a slight change in the version number.

Thanks!

Changed in pacemaker (Ubuntu Trusty):
status:	New → In Progress

Revision history for this message

Chris J Arges (arges) wrote on 2014-09-23: Please test proposed package

#12

Hello Rafael, or anyone else affected,

Accepted pacemaker into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in pacemaker (Ubuntu Trusty):
status:	In Progress → Fix Committed
tags:	added: verification-needed

Revision history for this message

Nobuto Murata (nobuto) wrote on 2014-09-24:

#13

pacemaker-sru-verification.txt Edit (3.3 KiB, text/plain)

It works well with scenarios of fresh OpenStack deployments.
* distro repository: "Failed actions" is observed with `crm node standby`
* -proposed repository: no "Failed actions" with the same operation

I will try to double-check it in package upgrade scenario if I have time, the proposed package works as expected so far.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2014-09-24:

#14

keystone-ha.yaml Edit (2.1 KiB, text/plain)

fyi, I used an attached juju bundle to prepare environment for testing of the last comment.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2014-09-24:

#15

pacemaker-sru-verification-upgrade.txt Edit (6.3 KiB, text/plain)

Also verified with an upgrade scenario. "Failed actions" is no longer reproducible after upgrading the packages to -proposed.

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-09-30:

#16

This bug was fixed in the package pacemaker - 1.1.10+git20130802-1ubuntu2.1

---------------
pacemaker (1.1.10+git20130802-1ubuntu2.1) trusty; urgency=medium

Changed in pacemaker (Ubuntu Trusty):
status:	Fix Committed → Fix Released

Revision history for this message

Chris J Arges (arges) wrote on 2014-09-30: Update Released

#17

The verification of the Stable Release Update for pacemaker has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Jorge Niedbalski (niedbalski) on 2014-10-10

tags:

added: cts

Rafael David Tinoco (rafaeldtinoco) on 2015-05-02

Changed in pacemaker (Ubuntu):
assignee:	Rafael David Tinoco (inaddy) → nobody

Rafael David Tinoco (rafaeldtinoco) on 2015-05-02

no longer affects:

pacemaker (Debian)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

Add patch

Bug attachments

Add attachment

Remote bug watches

debbugs #757514
[done important patch] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntupacemaker package

Pacemaker "crm node standby" stops resource successfully, but lrmd still monitors it and causes "Failed actions"

Bug Description

Related branches

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
pacemaker package