Galera doesn't detect that a node has diverged, same grastate.dat on nodes having different data
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Galera |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
Hi
Here's something to scratch your heads on:
Summary:
It is possible to create a situation where disconnected galera nodes contain a different dataset, but same grastate.dat. When such nodes join together as a cluster, they will believe they are identical even if they are not, and different nodes have different contents.
Steps to repeat:
- shutdown all nodes
- one by one: start node, insert one unique row, shutdown node
(the bug probably happens on any combination of equal amount of DML on each node)
- start cluster
Note: On normal network partitioning Galera correctly chooses a quorum and prevents split brain from happening. In this test we specifically start node one by one in a state that allows them to diverge. The test is to see what happens when they re-join.
On all nodes we have the same state:
> select * from test.t;
+---+--
| k | h | t | v |
+---+--
| 1 | cluster130 | 12:38:52 | first row |
+---+--
1 row in set (0.00 sec)
./mysql-galera stop
One node at a time:
./mysql-galera -g gcomm:// start
>insert into t values (2, @@hostname, curtime(), 'second row - inserted in single node mode');
Query OK, 1 row affected (0.00 sec)
>select * from t;
+---+--
| k | h | t | v |
+---+--
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster130 | 12:48:42 | second row - inserted in single node mode |
+---+--
2 rows in set (0.00 sec)
./mysql-galera -g gcomm:// stop
(Other nodes having different h and t values.)
On all nodes:
cat mysql/var/
# GALERA saved state, version: 0.8, date: (todo)
uuid: 80f3cf22-
seqno: 3479
cert_index:
(All nodes have same uuid and seqno.)
Now start all nodes as a cluster again (gcomm://nodename)
>insert into t values (3, @@hostname, curtime(), 'third row - cluster rejoined');
Query OK, 1 row affected (0.01 sec)
>select * from t;
+---+--
| k | h | t | v |
+---+--
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster130 | 12:48:42 | second row - inserted in single node mode |
| 3 | cluster130 | 13:02:48 | third row - cluster rejoined |
+---+--
3 rows in set (0.00 sec)
>select * from test.t;
+---+--
| k | h | t | v |
+---+--
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster129 | 12:51:14 | second row - inserted in single node mode |
| 3 | cluster130 | 13:02:48 | third row - cluster rejoined |
+---+--
3 rows in set (0.00 sec)
> select * from test.t;
+---+--
| k | h | t | v |
+---+--
| 1 | cluster130 | 12:38:52 | first row |
| 2 | cluster128 | 12:53:59 | second row - inserted in single node mode |
| 3 | cluster130 | 13:02:48 | third row - cluster rejoined |
+---+--
3 rows in set (0.00 sec)
(Cluster is rejoined, new data is replicated, but split brain has occured and nodes have diverged.)
Expected result:
Nodes should discover that they have different states and ask for a copy from the first node (cluster130).
Actual result:
Nodes do not discover that their data has diverged, and form a cluster where each node has different dataset.
Suggested fix:
I have not thought about this too much, but...
It seems to me you need to change the uuid in grastate.dat when:
- one or more nodes leave the cluster
- seqno is increased
Another way of saying the same thing is that you need to change uuid when cluster configuration changes, except that nodes joining a cluster are ok. New UUID needs only be saved to grastate.dat when seqno would be increased.
Note that this is not a critical bug. This state can only be entered by deliberate user action / user error. But it would be nice if Galera can detect the diverged state and discard the diverged replica.
Changed in galera: | |
status: | New → Confirmed |
Changed in galera: | |
importance: | Undecided → Wishlist |
More thought on suggested fix:
It is probably desirable to minimize the times an UUID change is needed. For instance, if you will later implement some incremental SST method, it must be possible for a node to disconnect and rejoin and discover the same uuid - this means it can ask for incremental SST and not full dump.
A possible tweak to my proposed solution would be for the selected quorum to keep the same UUID, but the non-quorum partition would need to change UUID (if seqno increases).
Also, it may be a simple approach to not implement a solution for this, rather provide command line options for the user to affect whether SST should happen:
--sst-force : Discard the grastate.dat on this node and do SST from the primary component.
--sst-disable : Don't do SST. If grastate.dat does not match with the primary component being joined, report error and shut down, but don't delete my data.
--sst-ondemand or --sst-initial : Do SST if needed when joining cluster.
(Currently the last one is default behavior.)