Fuel for OpenStack

Rabbit OCF monitor returns 'generic error' when it should be 'not running' instead

Bug #1484280 reported by Vladimir Kuklin on 2015-08-12

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Vladimir Kuklin	Fuel for OpenStack 8.0
7.0.x	Fix Released	High	Vladimir Kuklin	Fuel for OpenStack 7.0-mu-1
8.0.x	Fix Released	High	Vladimir Kuklin	Fuel for OpenStack 8.0

Bug Description

It seems that https://review.openstack.org/#/c/199059/ introduced a regression that pacemaker monitor command for rabbitmq returns 1 exit code for stopped resource which actually marks resource as failed and makes pacemaker stop it. This will lead to an issue that stopped resource cannot be started at all.

Thus we need to revert the fix and find another solution for the original bug because returning OCF_ERR_GENERIC for monitor command for stopped resource is an obvious mistake according to pacemaker configuration.

http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-ocf-return-codes.html

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-12: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/212195

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status:	Triaged → In Progress

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-08-12: Re: Pacemaker does not start rabbitmq cluster

Here is a snippet from logs:

2015-08-12T21:24:49.956001+00:00 warning: warning: status_from_rc: Action 12 (p_rabbitmq-server:0_monitor_0) on node-1.test.domain.local failed (target: 7 vs. rc: 1): Error

Pacemaker is expecting 7 exit code and gets 1 thus not deciding to start anything at all. May be we need to revert the only line for monitor command when get_status returns rabbit as not running and not return ERR_GENERIC for this case.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-13:

This bug requires confirmations, it seems not valid as pacemaker should actually try to recover resources if monitor reported generic error, and do nothing if it was gracefully stopped (see original issue https://bugs.launchpad.net/fuel/+bug/1472230)

Changed in fuel:
status:	In Progress → Incomplete
tags:	added: pacemaker rabbitmq

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-13:

Please provide logs

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-13:

Vladimir, please look this snipped http://pastebin.com/9DKmCLXj. As you can see, unexpected stop action (triggered manually) will instantly make pacemaker to mark the rabbit resource failed AND restart it almost instant. This means the bug is not valid and pacemaker always recovers failed resources.

Changed in fuel:
status:	Incomplete → Invalid

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-13:

The second snippet is what monitor reports when pcs resource disable was issued: http://pastebin.com/pm5v4SDF

As you can see it reports generic error instead of not running, is it the original issue you've reported for this bug? If so, this impacts nothing as far as I can see and should be low prio

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-08-13:

I disagree with you here - if resource is not running, monitor should return not_running code, not error code

Changed in fuel:
importance:	Critical → High
status:	Invalid → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-17:

Yes, I agree. But this can't impact resource operations, AFAIK, hence this bug should be low / medium and targeted to 8.0

Changed in fuel:
importance:	High → Low
status:	Confirmed → Won't Fix
summary:	- Pacemaker does not start rabbitmq cluster + Rabbit OCF monitor returns 'generic error' when it should be 'not + running' instead

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-17:

The monitor action for OCF rabbitmq is too fuzzy, I suggest do not change it for this release. As I had shown in https://bugs.launchpad.net/fuel/+bug/1484280/comments/5 this "wrong return code" impacts nothing. If we tried to fix it now, we'd only introduce more bugs in this fuzzy monitor logic.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-31:

#10

Addressed by https://review.openstack.org/#/c/217738

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-15: Fix proposed to fuel-library (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/223548

Changed in fuel:
status:	Won't Fix → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-15: Fix proposed to fuel-library (stable/7.0)

#12

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/223552

Revision history for this message

Andrey Maximov (maximov) wrote on 2015-09-15:

#13

I've added 7.0 milestone, as it also affect 7.0 release

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-15: Fix merged to fuel-library (master)

#14

Reviewed: https://review.openstack.org/223548
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=00f28b5cc48bdba389922fa460e4a6e6e173589e
Submitter: Jenkins
Branch: master

commit 00f28b5cc48bdba389922fa460e4a6e6e173589e
Author: Vladimir Kuklin <email address hidden>
Date: Tue Sep 15 14:39:08 2015 +0300

Return NOT_RUNNING when beam is not RUNNING

    Change get_status to return NOT_RUNNING when
    beam is not_running. Otherwise, pacemaker
    will get stuck during rabbitmq failover and
    will not attempt to restart the failed resource

Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30
Closes-bug: #1484280

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-09-15:

#15

I assume this bug as critical as it affects cluster ability to reassemble.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-09-15:

#16

Bogdan, you are completely wrong. If you kill beam process your monitor script will return ERR_GENERIC thus blocking the node resource forever from being recovered as stop action for the failed resource will also get ERR_GENERIC instead of NOT_RUNNING. We just do not have this case in our automated tests as it is half-synthetical - we experience it when rabbitmq segfaults or fails for some other reason like being killed by some other process or kernel, which we do not test AFAIK, may be QA team can correct me.

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-09-15:

#17

Vladimir,
why this is Critical for 7.0 but High for 8.0?

Can you please provide exact scenario under which this issue can be experienced? The reason I'm asking is that every fix has a potential to introduce regressions. If this is corner case, or doesn't really impact all the users, we might want to consider this as High.

Please don't merge a fix to stable/7.0 without clear test scenarios, which fail now, and pass after fix is applied (and regression test suites are ran); without clear user impact described.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-09-16:

#18

Okay, I conducted some tests and figured out that the described behaviour is the corners case and we can leave current implementation as is and conduct additional research on necessity of this fix.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-21: Fix merged to fuel-library (stable/7.0)

#19

Reviewed: https://review.openstack.org/223552
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=058c428f9f43a2d07b0abaa7bdb992918b25d531
Submitter: Jenkins
Branch: stable/7.0

commit 058c428f9f43a2d07b0abaa7bdb992918b25d531
Author: Vladimir Kuklin <email address hidden>
Date: Tue Sep 15 14:39:08 2015 +0300

Return NOT_RUNNING when beam is not RUNNING

    Change get_status to return NOT_RUNNING when
    beam is not_running. Otherwise, pacemaker
    will get stuck during rabbitmq failover and
    will not attempt to restart the failed resource

Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30
Closes-bug: #1484280

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-06: Change abandoned on fuel-library (master)

#20

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: master
Review: https://review.openstack.org/212195
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Dmitriy Kruglov (dkruglov) on 2015-10-15

tags:

added: on-verification

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-10-16:

#21

Verified on MOS 8.0. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "141"
  build_id: "141"
  fuel-nailgun_sha: "1479c0b03ad928f2ea2a819fbf8218cff32e51b9"
  python-fuelclient_sha: "769df968e19d95a4ab4f12b1d2c76d385cf3168c"
  fuel-agent_sha: "cf699820fb0a4d20bef001861e006dc9797b5733"
  fuel-nailgun-agent_sha: "08e0a11cf1f29b705e4b910d9b9db5e9b708b6e3"
  astute_sha: "a090546d43c770ac27ca81c6f8c78ff0ba4a93e0"
  fuel-library_sha: "cd1b4b67d2b00fb10264d6626327688b170f0bf8"
  fuel-ostf_sha: "983d0e6fe64397d6ff3bd72311c26c44b02de3e8"
  fuel-createmirror_sha: "df6a93f7e2819d3dfa600052b0f901d9594eb0db"
  fuelmain_sha: "3303f41f99cf9167da01d503dd5d2c8dab141447"

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-10-16:

#22

Verified on MOS 7.0, custom ISO. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "1260"
  build_id: "2015-10-09_12-02-12"
  nailgun_sha: "edbae54d510edbaa1d379e9523febe5a0e5acd41"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "713698e88c6e1e4ed9ebad759a21266890898d57"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

tags:

removed: on-verification

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-library

Revision history for this message

Dmitriy Kruglov (dkruglov) wrote on 2015-11-04:

#23

Verified on MOS 7.0 MU1. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Dmitry Pyzhov (dpyzhov) on 2015-11-30

Changed in fuel:
milestone:	7.0 → 8.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.