Incompatible SST between 5.5.29 and 5.5.33
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
MySQL patches by Codership |
New
|
Undecided
|
Unassigned | |||
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
5.5 |
Fix Released
|
Undecided
|
Unassigned | |||
5.6 |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
We are running a Percona XtraDB Cluster with three nodes and two garbd instances(on 64bit CentOS 6.3), all running PXC 5.5.29. XtraBackup is used for SST.
Due to some unexplained cluster-wide crashes, we decided to perform upgrade to the latest version(5.5.33). After updating one of PXC nodes, it did not manage to sync with the cluster.
Domains: co1.ourdomain.com is donor; co3.ourdomain.com is the upgraded node(joiner).
1. Seems like the new version of PXC(5.5.33) is set by default to expect SST to be done in xb_stream, while the donor(still running 5.5.29) sends it in tar. There is no warning about that in changelog and it totally breaks backward compatibility.
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:
130921 2:41:07 [Note] WSREP: Prepared SST request: xtrabackup|
130921 2:41:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 2:41:07 [Note] WSREP: Assign initial position for certification: 407853908, protocol version: 2
130921 2:41:07 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.
130921 2:41:07 [Note] WSREP: Node 0 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 1 (co1.ourdomain.
130921 2:41:07 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 407855552)
130921 2:41:07 [Note] WSREP: Requesting state transfer: success, donor: 1
xb_stream_
2013/09/21 02:41:09 socat[32276] E write(1, 0x16e93e0, 2896): Broken pipe
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 1 (20130921 02:41:09.742)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20130921 02:41:09.744)
WSREP_SST: [INFO] Removing the sst_in_progress file (20130921 02:41:09.745)
2. After changing SST settings in my.cnf at joiner (streamfmt=tar), the first data transfer goes through fine. However, joiner gets stuck forever while waiting for donor to send another part of data(I believe binlogs generated during state transfer?):
Joiner log:
130921 16:38:35 [Note] WSREP: New cluster view: global state: cffdaa53-
130921 16:38:35 [Warning] WSREP: Gap in state sequence. Need state transfer.
130921 16:38:37 [Note] WSREP: Running: 'wsrep_
WSREP_SST: [INFO] Streaming with tar (20130921 16:38:37.872)
WSREP_SST: [INFO] Using netcat as streamer (20130921 16:38:37.874)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:38:37.881)
130921 16:38:38 [Note] WSREP: Prepared SST request: xtrabackup|
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-
at galera/
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.
130921 16:38:38 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 410643189)
130921 16:38:38 [Note] WSREP: Requesting state transfer: success, donor: 0
joiner: 96MB 0:00:10 [ 9.6MB/s] [<=> ]
joiner: 198MB 0:00:20 [10.2MB/s] [ <=> ]
joiner: 296MB 0:00:30 [9.74MB/s] [ <=> ]
joiner: 376MB 0:00:40 [8.02MB/s] [ <=> ]
joiner: 381MB 0:00:40 [9.34MB/s] [ <=> ]
WSREP_SST: [INFO] NOTE: Joiner-Recv-gtid took 41 seconds (20130921 16:39:18.699)
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
WSREP_SST: [INFO] Proceeding with SST (20130921 16:39:18.706)
WSREP_SST: [INFO] Cleaning the existing datadir (20130921 16:39:18.707)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:39:18.710)
joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
// it continues like that forever
Donor log:
130921 16:38:35 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.
130921 16:38:38 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 410643189)
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Running: 'wsrep_
130921 16:38:38 [Note] WSREP: sst_donor_thread signaled with 0
130921 16:39:17 [Note] WSREP: Provider paused at cffdaa53-
130921 16:39:18 [Note] WSREP: Provider resumed.
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.
130921 16:39:18 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
130921 16:39:18 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Synchronized with group, ready for connections
130921 16:39:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
I am not running any kind of firewalls or anything that could prevent packets reaching their destination. Before updating the node to the latest PXC, SST was working just fine.
As the joiner was waiting for data, I checked the donor. It was happily working in SYNCED state without any traces of any SST being in progress.
No processes like tar/nc/
3. At the moment the donor changes status(
2013-09-21 16:39:18.709 WARN: Protocol violation. JOIN message sender 0 (co1.ourdomain.com) is not in state transfer (SYNCED). Message ignored.
Another report of the issue from point 1: http:// www.percona. com/forums/ questions- discussions/ percona- xtradb- cluster/ 11674-upgrade- from-5- 5-31-5- 5-33-issue