mariadb_recovery fails and data loss
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla |
Fix Released
|
Critical
|
Jeffrey Zhang | ||
Liberty |
Won't Fix
|
Critical
|
Jeffrey Zhang | ||
Mitaka |
Won't Fix
|
Critical
|
Jeffrey Zhang |
Bug Description
Test:
- Situation to mimic is that db nodes/containers have gone down one by one, and finally the last one goes down. During this gradual shutdown, writes to the database takes place. The end situation is that we need to do recovery where not all nodes are in sync.
Test setup:
- kolla master
- centos source built 20160926
- multinode
Test execution steps:
- make sure all nodes are in syn (show global status like 'wsrep%')
- shutdown mariadb on 2 nodes. (docker stop mariadb)
- create some users and verify that they exist in db. (openstack user create foo, user list)
- shutdown last mariadb node.
- kolla-ansible mariadb_recovery
- check if users exists
-------
inventory file (please note the order of hosts, since that matters.)
[control]
eselde02u32.
eselde02u33.
eselde02u34.
-------
Test Case 1.
- shutdown node 33 and 34; create users; shutdown node 32
Result
- all mariadb containers come back online and report they are in sync.
- playbook works. log http://
- no data is lost. i.e. the users exist in the database
Test Case 2.
- shutdown node 32 and 34; create users; shutdown node 33
result
- all mariadb containers come back online and report they are in sync.
- Playbook actually fails. log http://
- data lost. intermittent. 50% yes 50% no, of the runs
Test Case 3.
- shutdown node 32 and 33; create users; shutdown node 34
Result:
- All mariadb container come back online and report they are in sync.
- Playbook actually fails. log http://
- data lost. intermittent. 50% yes 50% no, of the runs
-------
Conclusion:
I have only read through the code briefly, and it is probably more than one way of doing this, so this is purely speculative on my end. It seems that the code always attempts recovery on the first node in the inventory file, but according to galera documention it is imperativ that recovery must be done on the node that has the highest sequence number. This is why test case 1 works, but case 2 and 3 fails.
summary: |
- mariadb_recovery fails and intermittent data loss + mariadb_recovery fails and data loss |
Changed in kolla: | |
status: | New → Confirmed |
importance: | Undecided → Critical |
milestone: | none → newton-rc2 |
thanks bjolo, the recovery node is the root cause. I will try to fix this.