MySQL OCF RA may not always recover all of the cluster members

Bug #1573529 reported by Bogdan Dobrelya
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Bogdan Dobrelya
Mitaka
Confirmed
High
Bogdan Dobrelya

Bug Description

There is a rare corner case when some of the DB cluster members refuse to join with an error like:
[ERROR] WSREP: Local state seqno (14127) is greater than group seqno (14126): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)

We have to decide how to deal with the such cases when the most seen GTID has not the latest SEQNO, while the minority nodes with another GTID(s) may have the most recent SEQNO

So we have to either:
* change how we evaluate the master (use just max(SEQNO) and ignore the most seen GTIDs)
* remove grastate.dat from OCF RA as recommended (DATA LOSS risks, bad idea) - no way to go, the OCF RA may end up removing 3/5 and data lost.
* allow resources to be recovered by an admin's decision and touch nothing (leave nodes stopped - no fully-automated recovery) - no fix in fact, leave as is and document as known issue, requires manual recovery steps.

The latter one seems the only doable option

Example output of the crm_mon -fotAW -1 command:
Online: [ n1 n2 n3 n4 n5 ]

 Clone Set: p_mysql-clone [p_mysql]
     Started: [ n1 n2 ]
     Stopped: [ n3 n4 n5 ]

Node Attributes:
* Node n1:
    + gtid : dc7a6c0c-0889-11e6-8326-478c77479e3b:22692
* Node n2:
    + gtid : dc7a6c0c-0889-11e6-8326-478c77479e3b:22692
* Node n3:
    + gtid : dc7a6c0c-0889-11e6-8326-478c77479e3b:23121
* Node n4:
    + gtid : dc7a6c0c-0889-11e6-8326-478c77479e3b:23785
* Node n5:
    + gtid : dc7a6c0c-0889-11e6-8326-478c77479e3b:-1

As you can see, 2/5 nodes have 22692, 1/5 has a greater 23121, and 2/5 has
23785. Note, the n5's GTID value stored in CIB is not actual, the real one can be seen as:
ssh n5 /usr/bin/mysqld_safe --wsrep-recover
160425 07:46:36 mysqld_safe WSREP: Recovered position dc7a6c0c-0889-11e6-8326-478c77479e3b:23785

So how to recover that, that is the question.

Changed in mos:
importance: Undecided → Medium
milestone: none → 10.0
assignee: nobody → Fuel Library Team (fuel-library)
no longer affects: mos
tags: added: area-library galera
Changed in fuel:
importance: Undecided → Medium
milestone: none → 10.0
assignee: nobody → Fuel Library Team (fuel-library)
summary: - MySQL OCF RA may not always recover not all of the cluster members
+ MySQL OCF RA may not always recover all of the cluster members
Dmitry Klenov (dklenov)
Changed in fuel:
status: New → Confirmed
tags: added: area-docs docs
removed: area-library
description: updated
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Documentation Team (fuel-docs)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raising to high due to UX impact to a DB cluster recovery

Changed in fuel:
importance: Medium → High
description: updated
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to that topic https://groups.google.com/forum/#!topic/codership-team/Dar30tX8JEc the OCF RA shall just pick the most recent node's UUID:SEQNO and start it as a seed node making the rest rejoin.

description: updated
Changed in fuel:
assignee: Fuel Documentation Team (fuel-docs) → Fuel Library Team (fuel-library)
tags: added: area-library
removed: area-docs docs
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/309891

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/newton
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Bogdan Dobrelya (bogdando)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.