MySQL patches by Codership

Incompatible SST between 5.5.29 and 5.5.33

Bug #1228618 reported by Tomasz Klekot on 2013-09-21

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
MySQL patches by Codership	New	Undecided	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.34-23.7.6
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.14-25.1

Bug Description

We are running a Percona XtraDB Cluster with three nodes and two garbd instances(on 64bit CentOS 6.3), all running PXC 5.5.29. XtraBackup is used for SST.
Due to some unexplained cluster-wide crashes, we decided to perform upgrade to the latest version(5.5.33). After updating one of PXC nodes, it did not manage to sync with the cluster.

Domains: co1.ourdomain.com is donor; co3.ourdomain.com is the upgraded node(joiner).

1. Seems like the new version of PXC(5.5.33) is set by default to expect SST to be done in xb_stream, while the donor(still running 5.5.29) sends it in tar. There is no warning about that in changelog and it totally breaks backward compatibility.

WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20130921 02:41:07.101)
130921 2:41:07 [Note] WSREP: Prepared SST request: xtrabackup|10.32.150.244:4444/xtrabackup_sst
130921 2:41:07 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 2:41:07 [Note] WSREP: Assign initial position for certification: 407853908, protocol version: 2
130921 2:41:07 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.32.150.244:4568
130921 2:41:07 [Note] WSREP: Node 0 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 1 (co1.ourdomain.com)(SYNCED) as donor.
130921 2:41:07 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 407855552)
130921 2:41:07 [Note] WSREP: Requesting state transfer: success, donor: 1
xb_stream_read_chunk(): wrong chunk magic at offset 0x0.
2013/09/21 02:41:09 socat[32276] E write(1, 0x16e93e0, 2896): Broken pipe
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 1 (20130921 02:41:09.742)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20130921 02:41:09.744)
WSREP_SST: [INFO] Removing the sst_in_progress file (20130921 02:41:09.745)

2. After changing SST settings in my.cnf at joiner (streamfmt=tar), the first data transfer goes through fine. However, joiner gets stuck forever while waiting for donor to send another part of data(I believe binlogs generated during state transfer?):

Joiner log:
130921 16:38:35 [Note] WSREP: New cluster view: global state: cffdaa53-542e-11e2-0800-c339f0c57558:410643189, view# 37: Primary, number of nodes: 5, my index: 4, protocol version 2
130921 16:38:35 [Warning] WSREP: Gap in state sequence. Need state transfer.
130921 16:38:37 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.32.150.244' --auth 'wsrep_sst:xxxxxxxxxxxxxxxxx' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --parent '3727''
WSREP_SST: [INFO] Streaming with tar (20130921 16:38:37.872)
WSREP_SST: [INFO] Using netcat as streamer (20130921 16:38:37.874)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:38:37.881)
130921 16:38:38 [Note] WSREP: Prepared SST request: xtrabackup|10.32.150.244:4444/xtrabackup_sst
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (cffdaa53-542e-11e2-0800-c339f0c57558): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():436. IST will be unavailable.
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.com)(SYNCED) as donor.
130921 16:38:38 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 410643189)
130921 16:38:38 [Note] WSREP: Requesting state transfer: success, donor: 0
   joiner: 96MB 0:00:10 [ 9.6MB/s] [<=> ]
   joiner: 198MB 0:00:20 [10.2MB/s] [ <=> ]
   joiner: 296MB 0:00:30 [9.74MB/s] [ <=> ]
   joiner: 376MB 0:00:40 [8.02MB/s] [ <=> ]
   joiner: 381MB 0:00:40 [9.34MB/s] [ <=> ]
WSREP_SST: [INFO] NOTE: Joiner-Recv-gtid took 41 seconds (20130921 16:39:18.699)
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.com): State transfer to 4 (co3.ourdomain.com) complete.
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
WSREP_SST: [INFO] Proceeding with SST (20130921 16:39:18.706)
WSREP_SST: [INFO] Cleaning the existing datadir (20130921 16:39:18.707)
WSREP_SST: [INFO] Evaluating nc -dl 4444 | pv -f -i 10 -N joiner | tar xfi - ; RC=( ${PIPESTATUS[@]} ) (20130921 16:39:18.710)
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
   joiner: 0B 0:00:10 [ 0B/s ] [<=> ]
// it continues like that forever

Donor log:
130921 16:38:35 [Note] WSREP: Assign initial position for certification: 410643189, protocol version: 2
130921 16:38:38 [Note] WSREP: Node 4 (co3.ourdomain.com) requested state transfer from '*any*'. Selected 0 (co1.ourdomain.com)(SYNCED) as donor.
130921 16:38:38 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 410643189)
130921 16:38:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130921 16:38:38 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address '10.32.150.244:4444/xtrabackup_sst' --auth 'wsrep_sst:xxxxxxxxxxxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --gtid 'cffdaa53-542e-11e2-0800-c339f0c57558:410643189''
130921 16:38:38 [Note] WSREP: sst_donor_thread signaled with 0
130921 16:39:17 [Note] WSREP: Provider paused at cffdaa53-542e-11e2-0800-c339f0c57558:410650116
130921 16:39:18 [Note] WSREP: Provider resumed.
130921 16:39:18 [Note] WSREP: 0 (co1.ourdomain.com): State transfer to 4 (co3.ourdomain.com) complete.
130921 16:39:18 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Member 0 (co1.ourdomain.com) synced with group.
130921 16:39:18 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 410650116)
130921 16:39:18 [Note] WSREP: Synchronized with group, ready for connections
130921 16:39:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

I am not running any kind of firewalls or anything that could prevent packets reaching their destination. Before updating the node to the latest PXC, SST was working just fine.
As the joiner was waiting for data, I checked the donor. It was happily working in SYNCED state without any traces of any SST being in progress.
No processes like tar/nc/socat/xtrabackup were running on it.

3. At the moment the donor changes status(DONOR/DESYNCED->JOINED or JOINED->SYNCED, not really sure), a weird entry appears in log of garbd instances:
2013-09-21 16:39:18.709 WARN: Protocol violation. JOIN message sender 0 (co1.ourdomain.com) is not in state transfer (SYNCED). Message ignored.

Tags:

Revision history for this message

Tomasz Klekot (tomksoft) wrote on 2013-09-21:

Another report of the issue from point 1: http://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/11674-upgrade-from-5-5-31-5-5-33-issue

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-09-21:

"JOIN message sender 0 (co1.ourdomain.com) is not in state transfer (SYNCED)."
- this message can be safely ignored. There can be several JOIN messages from the same sender.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-09-21:

@Tomasz,

This is due to https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1222122

It has been documented now here http://www.percona.com/doc/percona-xtradb-cluster/errata.html#incompatibilities

as well as here http://www.percona.com/doc/percona-xtradb-cluster/manual/xtrabackup_sst.html

tags:

added: doc

Revision history for this message

Tomasz Klekot (tomksoft) wrote on 2013-09-21:

I can't really find it in the errata or the SST docs. I understand it is two staged now, but it does not clearly say that the SST compatibility is totally broken. Doesn't it mean that it now became impossible to do zero-downtime PXC upgrade?

Revision history for this message

Tomasz Klekot (tomksoft) wrote on 2013-09-21:

Ah, I can see the Errata has just been updated. Thank you!

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-09-23: Re: [DOC] Incompatible SST between 5.5.29 and 5.5.33

Marking this 'fix committed' since it has been documented for now.

Changed in percona-xtradb-cluster:
status:	New → Fix Committed
summary:	- Incompatible SST between 5.5.29 and 5.5.33 + [DOC] Incompatible SST between 5.5.29 and 5.5.33

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-10-16:

I have added the older xtrabackup as xtrabackup-v1 (so you can give as wsrep_sst_donor=xtrabackup-v1).

summary:

- [DOC] Incompatible SST between 5.5.29 and 5.5.33
+ Incompatible SST between 5.5.29 and 5.5.33

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1444

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.