Rabbit OCF monitor returns 'generic error' when it should be 'not running' instead

Bug #1484280 reported by Vladimir Kuklin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Vladimir Kuklin
7.0.x
Fix Released
High
Vladimir Kuklin
8.0.x
Fix Released
High
Vladimir Kuklin

Bug Description

It seems that https://review.openstack.org/#/c/199059/ introduced a regression that pacemaker monitor command for rabbitmq returns 1 exit code for stopped resource which actually marks resource as failed and makes pacemaker stop it. This will lead to an issue that stopped resource cannot be started at all.

Thus we need to revert the fix and find another solution for the original bug because returning OCF_ERR_GENERIC for monitor command for stopped resource is an obvious mistake according to pacemaker configuration.

http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-ocf-return-codes.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/212195

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Revision history for this message
Vladimir Kuklin (vkuklin) wrote : Re: Pacemaker does not start rabbitmq cluster

Here is a snippet from logs:

2015-08-12T21:24:49.956001+00:00 warning: warning: status_from_rc: Action 12 (p_rabbitmq-server:0_monitor_0) on node-1.test.domain.local failed (target: 7 vs. rc: 1): Error

Pacemaker is expecting 7 exit code and gets 1 thus not deciding to start anything at all. May be we need to revert the only line for monitor command when get_status returns rabbit as not running and not return ERR_GENERIC for this case.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This bug requires confirmations, it seems not valid as pacemaker should actually try to recover resources if monitor reported generic error, and do nothing if it was gracefully stopped (see original issue https://bugs.launchpad.net/fuel/+bug/1472230)

Changed in fuel:
status: In Progress → Incomplete
tags: added: pacemaker rabbitmq
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please provide logs

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Vladimir, please look this snipped http://pastebin.com/9DKmCLXj. As you can see, unexpected stop action (triggered manually) will instantly make pacemaker to mark the rabbit resource failed AND restart it almost instant. This means the bug is not valid and pacemaker always recovers failed resources.

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The second snippet is what monitor reports when pcs resource disable was issued: http://pastebin.com/pm5v4SDF

As you can see it reports generic error instead of not running, is it the original issue you've reported for this bug? If so, this impacts nothing as far as I can see and should be low prio

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I disagree with you here - if resource is not running, monitor should return not_running code, not error code

Changed in fuel:
importance: Critical → High
status: Invalid → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Yes, I agree. But this can't impact resource operations, AFAIK, hence this bug should be low / medium and targeted to 8.0

Changed in fuel:
importance: High → Low
status: Confirmed → Won't Fix
summary: - Pacemaker does not start rabbitmq cluster
+ Rabbit OCF monitor returns 'generic error' when it should be 'not
+ running' instead
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The monitor action for OCF rabbitmq is too fuzzy, I suggest do not change it for this release. As I had shown in https://bugs.launchpad.net/fuel/+bug/1484280/comments/5 this "wrong return code" impacts nothing. If we tried to fix it now, we'd only introduce more bugs in this fuzzy monitor logic.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/223548

Changed in fuel:
status: Won't Fix → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/223552

Revision history for this message
Andrey Maximov (maximov) wrote :

I've added 7.0 milestone, as it also affect 7.0 release

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/223548
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=00f28b5cc48bdba389922fa460e4a6e6e173589e
Submitter: Jenkins
Branch: master

commit 00f28b5cc48bdba389922fa460e4a6e6e173589e
Author: Vladimir Kuklin <email address hidden>
Date: Tue Sep 15 14:39:08 2015 +0300

    Return NOT_RUNNING when beam is not RUNNING

    Change get_status to return NOT_RUNNING when
    beam is not_running. Otherwise, pacemaker
    will get stuck during rabbitmq failover and
    will not attempt to restart the failed resource

    Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30
    Closes-bug: #1484280

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I assume this bug as critical as it affects cluster ability to reassemble.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Bogdan, you are completely wrong. If you kill beam process your monitor script will return ERR_GENERIC thus blocking the node resource forever from being recovered as stop action for the failed resource will also get ERR_GENERIC instead of NOT_RUNNING. We just do not have this case in our automated tests as it is half-synthetical - we experience it when rabbitmq segfaults or fails for some other reason like being killed by some other process or kernel, which we do not test AFAIK, may be QA team can correct me.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Vladimir,
why this is Critical for 7.0 but High for 8.0?

Can you please provide exact scenario under which this issue can be experienced? The reason I'm asking is that every fix has a potential to introduce regressions. If this is corner case, or doesn't really impact all the users, we might want to consider this as High.

Please don't merge a fix to stable/7.0 without clear test scenarios, which fail now, and pass after fix is applied (and regression test suites are ran); without clear user impact described.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Okay, I conducted some tests and figured out that the described behaviour is the corners case and we can leave current implementation as is and conduct additional research on necessity of this fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/7.0)

Reviewed: https://review.openstack.org/223552
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=058c428f9f43a2d07b0abaa7bdb992918b25d531
Submitter: Jenkins
Branch: stable/7.0

commit 058c428f9f43a2d07b0abaa7bdb992918b25d531
Author: Vladimir Kuklin <email address hidden>
Date: Tue Sep 15 14:39:08 2015 +0300

    Return NOT_RUNNING when beam is not RUNNING

    Change get_status to return NOT_RUNNING when
    beam is not_running. Otherwise, pacemaker
    will get stuck during rabbitmq failover and
    will not attempt to restart the failed resource

    Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30
    Closes-bug: #1484280

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: master
Review: https://review.openstack.org/212195
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 8.0. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "141"
  build_id: "141"
  fuel-nailgun_sha: "1479c0b03ad928f2ea2a819fbf8218cff32e51b9"
  python-fuelclient_sha: "769df968e19d95a4ab4f12b1d2c76d385cf3168c"
  fuel-agent_sha: "cf699820fb0a4d20bef001861e006dc9797b5733"
  fuel-nailgun-agent_sha: "08e0a11cf1f29b705e4b910d9b9db5e9b708b6e3"
  astute_sha: "a090546d43c770ac27ca81c6f8c78ff0ba4a93e0"
  fuel-library_sha: "cd1b4b67d2b00fb10264d6626327688b170f0bf8"
  fuel-ostf_sha: "983d0e6fe64397d6ff3bd72311c26c44b02de3e8"
  fuel-createmirror_sha: "df6a93f7e2819d3dfa600052b0f901d9594eb0db"
  fuelmain_sha: "3303f41f99cf9167da01d503dd5d2c8dab141447"

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 7.0, custom ISO. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "1260"
  build_id: "2015-10-09_12-02-12"
  nailgun_sha: "edbae54d510edbaa1d379e9523febe5a0e5acd41"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "713698e88c6e1e4ed9ebad759a21266890898d57"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

tags: removed: on-verification
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 7.0 MU1. The issue is not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.