Pacemaker mysql resource shall have failcounts configured

Bug #1572440 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
Mitaka
Fix Released
High
Bogdan Dobrelya

Bug Description

W/o failcounts defined, the resource may fail to start and be left stopped.

How to reproduce:
* Deploy a cluster
* Make impossible for the pacemaker mysql clone resource to be started: add a exit 1 to the OCF RA action start.
* Issue crm resource cleanup p_mysql-clone && crm resource restart p_mysql-clone
* Check transitions summary with crm_simulate -Ls | grep -v "\-INF"
* Wait for a while, like 5 min or so, and recheck transitions.

Expected:
 Operation start must be always in transitions plan, for example:
Transition Summary:
 * Start p_mysql:0 (n1)
 * Start p_mysql:1 (n2)
 * Start p_mysql:2 (n3)
 * Start p_mysql:3 (n4)
 * Start p_mysql:4 (n5)

Actual:
 It gives up starting the resource

Solution:
 Configure failure modes for the pacemaker resource, for example like we do for the rabbit resource:
meta migration-threshold=10 failure-timeout=30s resource-stickiness=100

Changed in fuel:
importance: Undecided → High
tags: added: galera pacemaker
description: updated
description: updated
tags: added: tech-debt
tags: added: area-library
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/newton
Revision history for this message
Andrey Maximov (maximov) wrote :

how does it affect users ?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The UX is manual recovery of a cluster may be needed when DB nodes give up to being start by a Pacemaker. So it impacts DB availability.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/314031

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

How to verify:
MySQL resource instances should be like:

primitive p_mysqld ocf:fuel:mysql-wss \
        params config="/etc/mysql/my.cnf" socket="/var/run/mysqld/mysqld.sock" test_conf="/etc/mysql/user.cnf" \
        meta failure-timeout=30s migration-threshold=10 resource-stickiness=100 \
... snip ...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/314031
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=f55a2b98e0db2d5a54c09bbb806d53ef4b3cb794
Submitter: Jenkins
Branch: master

commit f55a2b98e0db2d5a54c09bbb806d53ef4b3cb794
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 9 11:53:35 2016 +0200

    Do not stop DB/MQ by a Pacemaker quorum

    Also configure fail modes for the DB resource
    the same as for the MQ one.

    Closes-bug: #1577689
    Closes-bug: #1572440

    Change-Id: I3e1ecf67b8bc205920ecd3157b3a422ecd9c6564
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/317934

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/317934
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=8c595c20fa155fab95d390939b50495503a0b028
Submitter: Jenkins
Branch: stable/mitaka

commit 8c595c20fa155fab95d390939b50495503a0b028
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 9 11:53:35 2016 +0200

    Do not stop DB/MQ by a Pacemaker quorum

    Also configure fail modes for the DB resource
    the same as for the MQ one.

    Closes-bug: #1577689
    Closes-bug: #1572440

    Change-Id: I3e1ecf67b8bc205920ecd3157b3a422ecd9c6564
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit f55a2b98e0db2d5a54c09bbb806d53ef4b3cb794)

Revision history for this message
Maksym Strukov (unbelll) wrote :

1. Deploy HA cluster
2. Edit /usr/lib/ocf/resource.d/fuel/mysql-wss add to start section `exit 1`
3. Run `crm resource cleanup clone_p_mysqld && crm resource restart clone_p_mysqld`
4. Check transition summary with `crm_simulate -Ls | grep -v "\-INF"`
You will get smth like:
Transition Summary:
* Recover p_mysqld:0 (Started node-1.test.domain.local)
* Start p_mysqld:1 (node-3.test.domain.local)
* Start p_mysqld:2 (node-2.test.domain.local)

5. Repeat last step in a few minute, periodically should appear
* Start p_mysqld:2 (node-1.test.domain.local)

Verified as fixed in 9.0-mos-485

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.