kolla-ansible

mariadb_recovery is prone to data loss

Bug #1682153 reported by Sam Yaple on 2017-04-12

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	kolla-ansible	Fix Released	Critical	Tudosoiu Marian	kolla-ansible queens-3

Bug Description

While the mariadb_recovery playbook has never been the shining pinnacle of Galera recovery that it could be, it is in a pretty bad state right now.

The playbook in the past was "supply recovery host or it will recover the first host in the mariadb group (mariadb[0])". Now it is attempting to be a bit smarter and properly read the grastate.dat, but not accounting for non-graceful shutdowns.

A patch added prior to Newton reads the grastate.dat and tries to parse it for the highest seqno. The data-loss scenario is when you have shutdown a node gracefully, then some time passes, then your cluster crashes. This isn't uncommon and shouldn't be dismissed. When that happens the playbooks will choose the old, gracefully shutdown node and stomp the data on the rest of the nodes. This is all done without user interaction and is exceedingly dangerous since no backup is done either.

The proper recovery method that works automated a good portion of the time is as follows:

* Check if all mariadb nodes are stopped
    * if not stopped then do not recover
* Check if any mariadb nodes have gvwstate.dat
    * if gvwstate.dat found start *only* the nodes with gvwstate.dat without special options
        * This is not a garaunteed recovery, but it is a safe action (no data loss can occur)
* Check if any mariadb nodes have grastate.dat
    * If grastate.dat exists and has a seqno of -1 on any node, it is not safe to autorecover, abort
    * if no -1 exists on any node, bootstrap the node with the highest seqno

This will cover a good chunk of the failure scenarios, including graceful shutdowns and full cluster outages without any risk of data. If it cannot automatically recover then the user should be *forced* to supply a bootstrap node on the command line based on whatever critera they want (sometimes guessing, but guessing should be done by the user, not Kolla-Ansible).

Tags:

Paul Bourke (pauldbourke) on 2017-04-12

Changed in kolla-ansible:
importance:	Undecided → Critical
status:	New → Confirmed

Duong Ha-Quang (duonghq) on 2017-06-06

Changed in kolla-ansible:
milestone:	none → pike-2

Jeffrey Zhang (jeffrey4l) on 2017-06-14

Changed in kolla-ansible:
milestone:	pike-2 → pike-3

Jeffrey Zhang (jeffrey4l) on 2017-07-30

Changed in kolla-ansible:
milestone:	pike-3 → pike-rc1

Revision history for this message

zongyimin (yanpeifei) wrote on 2017-08-04:

I also meet it.

Eduardo Gonzalez (egonzalez90) on 2017-09-06

Changed in kolla-ansible:
milestone:	pike-rc1 → pike-rc2
milestone:	pike-rc2 → queens-1

Jeffrey Zhang (jeffrey4l) on 2017-12-11

Changed in kolla-ansible:
milestone:	queens-2 → queens-3

Tudosoiu Marian (mtudosoiu) on 2018-01-04

Changed in kolla-ansible:
assignee:	nobody → Tudosoiu Marian (mtudosoiu)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-04: Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/531115

Tudosoiu Marian (mtudosoiu) on 2018-01-04

Changed in kolla-ansible:
status:	Confirmed → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-04: Change abandoned on kolla-ansible (master)

Change abandoned by Tudosoiu Marian (marian.tudosoiu@1and1.ro) on branch: master
Review: https://review.openstack.org/531115

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-04: Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/531122

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-10: Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/532509

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-10:

Fix proposed to branch: master
Review: https://review.openstack.org/532515

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-10: Change abandoned on kolla-ansible (master)

Change abandoned by Tudosoiu Marian (marian.tudosoiu@1and1.ro) on branch: master
Review: https://review.openstack.org/532509

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-10:

Change abandoned by Tudosoiu Marian (marian.tudosoiu@1and1.ro) on branch: master
Review: https://review.openstack.org/532515

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-31: Fix merged to kolla-ansible (master)

Reviewed: https://review.openstack.org/531122
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=cead8ec6235bfd4f8d36efcbab1f3d0a288cbd96
Submitter: Zuul
Branch: master

commit cead8ec6235bfd4f8d36efcbab1f3d0a288cbd96
Author: Marian Tudosoiu <marian.tudosoiu@1and1.ro>
Date: Thu Jan 4 12:32:12 2018 +0200

Rework mariadb recovery tasks

    In recover_cluster.yaml playbook the task to find the highest
    seqno/Global Transaction ID is no longer relying only on grastate.dat
    Instead it now follows the recommendations from galera cluster website
    http://galeracluster.com/documentation-webpages/restartingcluster.html

Closes-Bug: 1682153

Change-Id: I5fc3eaa8baee659576c4c39aef9cfd351c8e9af7

Changed in kolla-ansible:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-31: Fix proposed to kolla-ansible (stable/pike)

#10

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/539632

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-05: Related fix merged to kolla-ansible (master)

#11

Reviewed: https://review.openstack.org/539628
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=465bc9ee1c9324ba95b223f7b8d11bc0cd376608
Submitter: Zuul
Branch: master

commit 465bc9ee1c9324ba95b223f7b8d11bc0cd376608
Author: Alexandru Bogdan Pica <alexandru.pica@1and1.ro>
Date: Wed Jan 31 20:27:38 2018 +0200

Improve mariadb_recovery

The purpose of this change is to improve upon
https://review.openstack.org/#/c/531122/

- Moved vars inside the defaults/main.yml file
- Made the regex for the lineinfile safer

    Change-Id: Id581c0b36f3d4bd61d3627b8364b79296b967387
    Closes-Bug: 1746567
    Related-Bug: 1682153

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-05: Fix merged to kolla-ansible (stable/pike)

#12

Reviewed: https://review.openstack.org/539632
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=283544ebd6c991a9adb4e3e68292b5b40ab8f55e
Submitter: Zuul
Branch: stable/pike

commit 283544ebd6c991a9adb4e3e68292b5b40ab8f55e
Author: Marian Tudosoiu <marian.tudosoiu@1and1.ro>
Date: Thu Jan 4 12:32:12 2018 +0200

Rework mariadb recovery tasks

Closes-Bug: 1682153

Change-Id: I5fc3eaa8baee659576c4c39aef9cfd351c8e9af7
(cherry picked from commit cead8ec6235bfd4f8d36efcbab1f3d0a288cbd96)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-05: Related fix proposed to kolla-ansible (stable/pike)

#13

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/540940

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-06: Related fix merged to kolla-ansible (stable/pike)

#14

Reviewed: https://review.openstack.org/540940
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=37fa7fcfb06bffffe9b0f0f801ec5d32c470c4d5
Submitter: Zuul
Branch: stable/pike

commit 37fa7fcfb06bffffe9b0f0f801ec5d32c470c4d5
Author: Alexandru Bogdan Pica <alexandru.pica@1and1.ro>
Date: Wed Jan 31 20:27:38 2018 +0200

Improve mariadb_recovery

The purpose of this change is to improve upon
https://review.openstack.org/#/c/531122/

- Moved vars inside the defaults/main.yml file
- Made the regex for the lineinfile safer

    Change-Id: Id581c0b36f3d4bd61d3627b8364b79296b967387
    Closes-Bug: 1746567
    Related-Bug: 1682153
    (cherry picked from commit 465bc9ee1c9324ba95b223f7b8d11bc0cd376608)