Whole Cluste become NOT-Primary after ASIO issue on a node
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Galera | Status tracked in 3.x | |||||
2.x |
Fix Committed
|
Undecided
|
Unassigned | |||
3.x |
Fix Committed
|
Undecided
|
Unassigned | |||
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
5.5 |
Fix Released
|
Medium
|
Unassigned | |||
5.6 |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Hi folks.
we recently had an issue with a customer installation.
Customer has multiple clusters but the issue was present on on cluster only (at least that we know).
Customer starts to have issue on one node when all the machines on the cluster were starting to report a SCSI issue as follow:
Sense codes:
Nov 6 22:09:14 Server Administrator: Storage Service EventID: 2095 Unexpected sense. SCSI sense data: Sense key: 6 Sense code: 29 Sense qualifier: CD: Physical Disk 0:1:1 Controller 0, Connector 0
What happened was that ONLY ONE node was actually reading it as a network issue and then it starts to disconnect and rejoin the cluster.
The interruption was a little bit larger of the default suspect_timeout.
Customer obviously work on the issue to have it fix on the machines, but meanwhile another issue happened.
The cluster nodes are xdb1-5.
The xdb3 node was the one normally failing.
On the 131108 2:04:09 node xdb1 rejoin the cluster and xdb2 become is donor.
While performing the synchronization node xdb5 crashed with the error:
131108 2:04:26 [Note] WSREP: declaring b9f56651-
terminate called after throwing an instance of 'std::out_of_range'
what(): vector:
02:04:26 UTC - mysqld got signal 6 ;
The error is related to the ASIO libraries and as such I think it should be associate to the network transfer.
Node xdb5 stops and MySQL crashed.
At the same time the node xdb1 seems having an issue, not clear to me because the xdb5 crashes or because same/another network issue happens:
131108 2:04:26 [ERROR] WSREP: exception caused by message: evs::msg{
35d7689b-
692dff7e-
b9f56651-
131108 2:04:26 [ERROR] WSREP: state after handling message: evs::proto(
current_
0d7bd986-
35d7689b-
692dff7e-
b9f56651-
f9398cbc-
} joined {
} left {
} partitioned {
}),
input_map=
131108 2:04:26 [ERROR] WSREP: exception from gcomm, backend must be restarted:nlself_i != same_view.end(): (FATAL) <------
at gcomm/src/
131108 2:04:26 [Note] WSREP: Received self-leave message.
131108 2:04:26 [Note] WSREP: Flow-control interval: [0, 0]
131108 2:04:26 [Note] WSREP: Received SELF-LEAVE. Closing connection.
131108 2:04:26 [Note] WSREP: Shifting JOINER -> CLOSED (TO: 67975520)
131108 2:04:26 [Note] WSREP: RECV thread exiting 0: Success
After this also the other nodes get affected .
xdb2 seems to ne try to reconnect to the joiner but not successfully:
131108 2:04:26 [Note] WSREP: (692dff7e-
131108 2:04:26 [Note] WSREP: (692dff7e-
131108 2:04:28 [Note] WSREP: (692dff7e-
131108 2:04:30 [Note] WSREP: (692dff7e-
131108 2:05:00 [Note] WSREP: evs::proto(
131108 2:05:05 [Warning] WSREP: evs::proto(
evs::proto(
current_
0d7bd986-
35d7689b-
692dff7e-
b9f56651-
f9398cbc-
} joined {
} left {
} partitioned {
}),
As well xdb3-4.
What happened after that is that it seems all the nodes received the request to change the state to become NON-Primary, at that point production get affected.
Summarizing.
Cluster of 5 nodes.
xdb1 was Joining so active 4 of 5.
xdb2 become Donor so active 3 of 5.
xdb5 crashed so active 2 of 5.
Would have be possible that the remaining 2 Nodes choose to become NON-Active because different state and see it as split brain?
Why was not possible for xdb2 to rejoin the cluster, and do IST Sync and participate to the quorum given the xdb1 stops and then xdb2 was free, before the possible quorum declaration?
We suspect a possible link with the issue below, but I would like to have a better understanding of the sequence of the events and the root cause.
Attached the whole set of logs relevant to the issue.
Suspected related Issues:
https:/
https:/
Log file analysis:
Firstly there were 4 nodes online, xdb2-5. Then node xdb1 started join process and requested SST from xdb2. Due to network conditions, xdb5 dropped from group briefly at least one time (xdb5 log 2:24:15 and onwards). Due to lp:1232747 nodes xdb1 and xdb5 crashed during group renegotiation, which caused remaining xdb2-4 to form singleton groups because of previous failed attempts making evs install_ timeout_ count_ counter to reach maximum value. At this point primary component was lost and could not be re-established because some nodes (probably xdb1, xdb5) from previous known primary component were not present.
So, there are at least two issues to be addressed: timeout_ count_ counter maximum value higher or mark other nodes invalid one by one to avoid loosing too many nodes from group at once. One way would be to set max value to the size of last known group and mark nodes invalid only if they fail to reach consensus within install timeout period.
* Obvious one lp:1232747 which should be fixed
* Make evs install_
It might also make sense to isolate constantly failing nodes from the group for longer periods of time to avoid causing too much turbulence for the group.