Fuel for OpenStack

Pacemaker does not start MySQL in the right order after simultaneous MySQL killing on all controller nodes

Bug #1652008 reported by Aleksei Shishkin on 2016-12-22

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Vladimir Kuklin	Fuel for OpenStack 11.0
Mitaka	Fix Released	High	Vladimir Kuklin	Fuel for OpenStack 9.2
Newton	Fix Released	High	Sergii Golovatiuk	Fuel for OpenStack 10.1
Ocata	Fix Committed	High	Vladimir Kuklin	Fuel for OpenStack 11.0

Bug Description

Detailed bug description:
After simultaneous mysqld process killing on all controller nodes, Pacemaker detects that only one MySQL database is in Stopped state and tries to start in and join it to the Galera cluster (where no other MySQL instances really exist). Only after resource start timeout Pacemaker detects that all nodes are in stopped state and tries to start them in proper order.

Steps to reproduce:
From Fuel master run next commands:
fuel node | grep controller | awk '{print $10}' > controllers.list
for i in $(cat controllers.list); do ssh $i 'pkill -KILL -f mysqld' ; done

Expected results:
Pacemaker detects that MySQL database on all controller nodes are in stopped state.
Pacemaker starts MySQL database on all controller nodes in proper order and join it to Galera cluster.

Actual result:
Pacemaker detects that MySQL database is in stopped state only on one controller node.
Pacemaker tries to start MySQL database only on one controller node.
Start operation fails after resource start timeout because no other nodes available in the Galera cluster.
Pacemaker detects that MySQL database on all controller nodes are in stopped state.
Pacemaker starts MySQL database on all controller nodes in proper order and join it to Galera cluster.

Reproducibility:
Stably

Workaround:
No workaround exists

Impact:
MySQL cluster recovery takes longer than expected (by the value of the resource start timeout)

Description of the environment:
Versions of components: Mirantis OpenStack 9.x from the Master branch (drop week 44)

Tags:

Sergii Golovatiuk (sgolovatiuk) on 2016-12-22

Changed in fuel:
importance:	Undecided → High

Sergii Golovatiuk (sgolovatiuk) on 2016-12-22

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2016-12-22:

Aleksey, so which behaviour is unexpected here? Pacemaker has cadence of monitoring and notices the first failure, tries to start things, fails and then assembles it on the second attempt. This is expected behaviour of how Pacemaker handles failures. We may want to play with timeouts to change the probability distribution here.

Still, this is an expected behaviour for Pacemaker and I do not see any bug here, unless we provide more detailed criteria on recovery time.

If the bug still persists, the only candidate for regression is

https://github.com/openstack/fuel-library/commit/dda74618cd03aa2720afe8b0980947bf735dd6b2

But this behaviour is configurable, e.g. we can set binary parameter for mysql resource back to mysqld_safe and see whether the issue is gone.

Summary

1. We need strict criteria for recovery time that is violated
2. Please test changing the resource binary in pacemaker configuration to mysqld_safe and report whether it helps

Changed in fuel:
status:	Confirmed → Incomplete

Sergii Golovatiuk (sgolovatiuk) on 2016-12-22

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Aleksei Shishkin (ashishkin) wrote on 2016-12-29:

Download full text (3.4 KiB)

Vladimir, I have tested changing of the resource binary in pacemaker configuration to mysqld_safe and back to mysqld - there is no sensible difference in cluster recovery time.

My concern is that we have to wait for p_mysqld resource start timeout before mysql recovery will be started. It would be great if we can improve the OCF script to start cluster recovery immediately without waiting for start timeout of the MySQL instance which was not supposed to be started (because no other MySQL instances exist in Galera cluster to join it). From my point of view, current behaviour is not very logical (and therefore it caused the criticism from the customer side).
Also I know some cases when p_mysqld resource start timeout was increased by customer up to 20 minutes due to big MySQL database size (usually caused by Zabbix database) to successfully complete replication of entire database during MySQL cluster recovery. In this case restoring of the database can take quite a significant time.

Please let me know if it is possible to improve the OCF script and fix mentioned behavior.

Also I have gathered outputs from «pcs status» to clearly show you how this behavior looks like:

Thu Dec 29 11:27:27 UTC 2016
——> MySQL killed on all nodes

Thu Dec 29 11:27:37 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]

Thu Dec 29 11:28:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Started: [ node-6.test.domain.local node-7.test.domain.local ]
Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:17 UTC 2016
-----> MySQL started only on node-5

Thu Dec 29 11:28:24 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:39 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-7.test.domain.local
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-6.test.domain.local
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:55 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld (ocf::fuel:mysql-wss): FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:29:11 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
      Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-6.test.domain.local node-7.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:33:03 UTC 2016
------> MySQL stopped on node-5.test.domain.local by Pacemaker

Clone Set: clone_p_mysqld [p_mysqld]
Stopped: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local

Thu Dec 29 11:33:13 UTC 2016
———> Pacemaker is starting MySQL one by one

Thu Dec 29 11:33:35 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
      Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local ]
     Stopped: [ node-6.test.domain.local node-7.test.domain.local]

Thu Dec 29 11:34:06 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
Clone Set: clone_p_mysqld [p_mys...

Vladimir, I have tested changing of the resource binary in pacemaker configuration to mysqld_safe and back to mysqld - there is no sensible difference in cluster recovery time.

Please let me know if it is possible to improve the OCF script and fix mentioned behavior.

Also I have gathered outputs from «pcs status» to clearly show you how this behavior looks like:

Thu Dec 29 11:27:27 UTC 2016
——> MySQL killed on all nodes

Thu Dec 29 11:27:37 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]

Thu Dec 29 11:28:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-6.test.domain.local node-7.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:17 UTC 2016
-----> MySQL started only on node-5

Thu Dec 29 11:28:24 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:39 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-7.test.domain.local
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-6.test.domain.local
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:28:55 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     p_mysqld   (ocf::fuel:mysql-wss):  FAILED node-7.test.domain.local
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-5.test.domain.local ]

Thu Dec 29 11:33:03 UTC 2016
------> MySQL stopped on node-5.test.domain.local by Pacemaker

Clone Set: clone_p_mysqld [p_mysqld]
     Stopped: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local

Thu Dec 29 11:33:13 UTC 2016
———> Pacemaker is starting MySQL one by one

Thu Dec 29 11:34:06 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
      Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local node-7.test.domain.local]
     Stopped: [ node-6.test.domain.local]

Thu Dec 29 11:34:16 UTC 2016
Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-5.test.domain.local node-6.test.domain.local node-7.test.domain.local ]

------> MySQL started and everything is OK

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-09: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/417896

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-10: Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/418318

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-10: Fix proposed to fuel-library (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/418319

OpenStack Infra (hudson-openstack) on 2017-01-11

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin)

OpenStack Infra (hudson-openstack) on 2017-01-12

Changed in fuel:
assignee:	Vladimir Kuklin (vkuklin) → Sergii Golovatiuk (sgolovatiuk)

OpenStack Infra (hudson-openstack) on 2017-01-13

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-16: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/417896
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=52ed0305afcc78041511f51740a0c5151fc3af47
Submitter: Jenkins
Branch: master

commit 52ed0305afcc78041511f51740a0c5151fc3af47
Author: Sergii Golovatiuk <email address hidden>
Date: Wed Jan 4 21:29:00 2017 +0100

Adding mysqld_safe back to OCF script.

    - mysqld_safe helped to restart mysqld thus the convergence of cluster
      was a few times faster
    - Added logging for more convinient debugging

Closes-Bug: #1652008

Change-Id: I6ee1bf451f7c954aeb7a2974bb980054422454b8
Signed-off-by: Sergii Golovatiuk <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-16: Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/418318
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=6f9426688102b65bed0099d35a44d5aebb92a114
Submitter: Jenkins
Branch: stable/mitaka

commit 6f9426688102b65bed0099d35a44d5aebb92a114
Author: Sergii Golovatiuk <email address hidden>
Date: Wed Jan 4 21:29:00 2017 +0100

Adding mysqld_safe back to OCF script.

    - mysqld_safe helped to restart mysqld thus the convergence of cluster
      was a few times faster
    - Added logging for more convinient debugging

Closes-Bug: #1652008

Change-Id: I6ee1bf451f7c954aeb7a2974bb980054422454b8
Signed-off-by: Sergii Golovatiuk <email address hidden>

Ekaterina Shutova (eshutova) on 2017-01-16

tags:

added: on-verification

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-16: Fix merged to fuel-library (stable/newton)

Reviewed: https://review.openstack.org/418319
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=0a14c7399d1dd2e9bcef3b067839e53600517b8b
Submitter: Jenkins
Branch: stable/newton

commit 0a14c7399d1dd2e9bcef3b067839e53600517b8b
Author: Sergii Golovatiuk <email address hidden>
Date: Wed Jan 4 21:29:00 2017 +0100

Adding mysqld_safe back to OCF script.

    - mysqld_safe helped to restart mysqld thus the convergence of cluster
      was a few times faster
    - Added logging for more convinient debugging

Closes-Bug: #1652008

Change-Id: I6ee1bf451f7c954aeb7a2974bb980054422454b8
Signed-off-by: Sergii Golovatiuk <email address hidden>

Revision history for this message

Ekaterina Shutova (eshutova) wrote on 2017-01-24:

Verified on 9.2 snapshot #798.

tags:

removed: on-verification

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-27: Fix included in openstack/fuel-library 11.0.0.0rc1

#10

This issue was fixed in the openstack/fuel-library 11.0.0.0rc1 release candidate.

Ekaterina Shutova (eshutova) on 2017-04-07

tags:

added: on-verification

Revision history for this message

Ekaterina Shutova (eshutova) wrote on 2017-04-09:

#11

Verified on 10.1 #1578.

tags:

removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1652043

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.