Pacemaker does not start MySQL in the right order after simultaneous MySQL killing on all controller nodes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
High
|
Vladimir Kuklin | ||
Mitaka |
Fix Released
|
High
|
Vladimir Kuklin | ||
Newton |
Fix Released
|
High
|
Sergii Golovatiuk | ||
Ocata |
Fix Committed
|
High
|
Vladimir Kuklin |
Bug Description
Detailed bug description:
After simultaneous mysqld process killing on all controller nodes, Pacemaker detects that only one MySQL database is in Stopped state and tries to start in and join it to the Galera cluster (where no other MySQL instances really exist). Only after resource start timeout Pacemaker detects that all nodes are in stopped state and tries to start them in proper order.
Steps to reproduce:
From Fuel master run next commands:
fuel node | grep controller | awk '{print $10}' > controllers.list
for i in $(cat controllers.list); do ssh $i 'pkill -KILL -f mysqld' ; done
Expected results:
Pacemaker detects that MySQL database on all controller nodes are in stopped state.
Pacemaker starts MySQL database on all controller nodes in proper order and join it to Galera cluster.
Actual result:
Pacemaker detects that MySQL database is in stopped state only on one controller node.
Pacemaker tries to start MySQL database only on one controller node.
Start operation fails after resource start timeout because no other nodes available in the Galera cluster.
Pacemaker detects that MySQL database on all controller nodes are in stopped state.
Pacemaker starts MySQL database on all controller nodes in proper order and join it to Galera cluster.
Reproducibility:
Stably
Workaround:
No workaround exists
Impact:
MySQL cluster recovery takes longer than expected (by the value of the resource start timeout)
Description of the environment:
Versions of components: Mirantis OpenStack 9.x from the Master branch (drop week 44)
Changed in fuel: | |
importance: | Undecided → High |
Changed in fuel: | |
status: | New → Confirmed |
Changed in fuel: | |
status: | Incomplete → Confirmed |
Changed in fuel: | |
assignee: | Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin) |
Changed in fuel: | |
assignee: | Vladimir Kuklin (vkuklin) → Sergii Golovatiuk (sgolovatiuk) |
Changed in fuel: | |
assignee: | Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin) |
tags: | added: on-verification |
tags: | added: on-verification |
Aleksey, so which behaviour is unexpected here? Pacemaker has cadence of monitoring and notices the first failure, tries to start things, fails and then assembles it on the second attempt. This is expected behaviour of how Pacemaker handles failures. We may want to play with timeouts to change the probability distribution here.
Still, this is an expected behaviour for Pacemaker and I do not see any bug here, unless we provide more detailed criteria on recovery time.
If the bug still persists, the only candidate for regression is
https:/ /github. com/openstack/ fuel-library/ commit/ dda74618cd03aa2 720afe8b0980947 bf735dd6b2
But this behaviour is configurable, e.g. we can set binary parameter for mysql resource back to mysqld_safe and see whether the issue is gone.
Summary
1. We need strict criteria for recovery time that is violated
2. Please test changing the resource binary in pacemaker configuration to mysqld_safe and report whether it helps