JOINER partition during SST causes cluster hang
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Galera |
New
|
Undecided
|
Unassigned | ||
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
Incomplete
|
High
|
Unassigned |
Bug Description
This is PXC:
Version: '5.5.30' socket: '/var/lib/
JOINER: tamarind (XXX.XXX.XXX.206)
DONOR: tarragon (XXX.XXX.XXX.14)
Other cluster nodes:
tabasco (XXX.XXX.XXX.205), tandoori (XXX.XXX.XXX.207)
Timeline of events:
7:21AM - Cluster node started, enters JOINER state, DONOR starts Xtrabackup SST
7:44AM - JOINER gets partitioned from the cluster, cluster hangs (no write activity)
7:55AM - Xtrabackup takes FTWRL on DONOR
7:56AM - DONOR finishes Xtrabackup
7:57AM - Cluster write activity resumes, JOINER tries to become Synced, but gets error:
130613 7:57:28 [Warning] WSREP: Protocol violation. JOIN message sender 0 (tarragon) is not in state transfer (SYNCED). Message ignored.
Question is why did the cluster pause between 7:44 and 7:56? (logs attached)
Jay, looks like logs from tamarind and tarragon are identical and belong to a donor node. So we are missing the most important log - from the joiner (which partitioned).
Anyway, the immediate reason for a stall is clear - everybody was waiting for the state exchange message from tamarind.