Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Odd behavior of flow control behavior of a Donor node

Bug #1237500 reported by Jay Janssen on 2013-10-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Confirmed	Medium	Unassigned

Bug Description

I have a two node cluster taking some sysbench update workload on the first node:

mycluster / ip-10-28-26-53 / Galera 2.7(r157)
Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit
time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
16:13:49 P 4 2 Sync T/T 61 0 2k 3 907K 24.0 0.3 0 0 0 719 15 0 2
16:13:51 P 4 2 Sync T/T 0 0 2k 3 1.1M 24.0 0.2 0 0 0 554 6 0 2
16:13:52 P 4 2 Sync T/T 61 0 2k 1 915K 8.0 0.4 0 0 0 595 28 0 4
16:13:53 P 4 2 Sync T/T 20 0 1k 4 777K 32.0 0.2 0 0 0 519 25 0 5
16:13:55 P 4 2 Sync T/T 0 0 2k 2 994K 16.0 0.1 0 0 0 973 3 0 2
16:13:57 P 4 2 Sync T/T 61 0 2k 4 1.1M 32.0 0.2 0 0 0 359 4 0 2
16:13:58 P 4 2 Sync T/T 64 0 2k 1 845K 8.0 0.4 0 0 0 532 8 0 1
16:14:00 P 4 2 Sync T/T 0 0 2k 5 1.2M 40.0 0.2 0 0 0 736 8 0 2

There is some flow control, but the cluster is operating normally.

Now, on the second node I execute this:

node2> set global wsrep_desync=ON; flush tables with read lock; select sleep( 60 ); unlock tables; set global wsrep_desync=OFF; show global status like 'wsrep_local_recv_queue';

I can see flow control immediately stop and the queue start to grow on node2:

node1:

16:14:02 P 4 2 Sync T/T 9 0 2k 0 837K 0 0.0 0 0 0 841 17 0 5
16:14:03 P 4 2 Sync T/T 0 0 2k 0 1006K 0 0.0 0 0 0 1k 7 0 2
16:14:05 P 4 2 Sync T/T 0 0 2k 0 1012K 0 0.0 0 0 0 2k 11 0 4
16:14:07 P 4 2 Sync T/T 7 0 2k 0 1.1M 0 0.0 0 0 0 2k 10 0 3
16:14:09 P 4 2 Sync T/T 0 0 2k 0 810K 0 0.0 0 0 0 3k 7 0 2
16:14:10 P 4 2 Sync T/T 0 0 1k 0 655K 0 0.0 0 0 0 3k 41 1 9
16:14:11 P 4 2 Sync T/T 0 0 2k 0 1.0M 0 0.0 0 0 0 3k 26 0 6

node2:

mycluster / ip-10-28-203-10 / Galera 2.7(r157)
Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit
time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
16:13:54 P 4 2 Sync T/T 0 0 0 1k 0 739K 0.2 17 0 0 482 73 1 3
16:13:55 P 4 2 Sync T/T 0 0 0 1k 0 783K 0.1 11 0 0 531 46 1 2
16:13:56 P 4 2 Sync T/T 0 0 0 1k 0 688K 0.1 12 0 0 691 59 1 2
16:13:57 P 4 2 Sync T/T 0 0 0 1k 0 643K 0.1 17 0 0 349 72 1 3
16:13:58 P 4 2 Sync T/T 0 0 0 1k 0 714K 0.1 16 0 0 478 74 1 3
16:13:59 P 4 2 Sync T/T 0 0 0 1k 0 618K 0.2 16 0 0 546 80 1 3
16:14:01 P 4 2 Sync T/T 0 0 0 1k 0 712K 0.2 15 0 0 643 84 1 3
16:14:02 P 4 2 Dono T/T 0 404 0 866 0 459K 0.0 11 0 0 579 75 1 3
16:14:03 P 4 2 Dono T/T 0 2k 0 0 0 0 0.0 0 0 0 579 0 0 0
16:14:04 P 4 2 Dono T/T 0 3k 0 0 0 0 0.0 0 0 0 579 0 0 0
16:14:05 P 4 2 Dono T/T 0 4k 0 0 0 0 0.0 0 0 0 579 0 0 0
16:14:06 P 4 2 Dono T/T 0 6k 0 0 0 0 0.0 0 0 0 579 0 0 0
16:14:07 P 4 2 Dono T/T 0 7k 0 0 0 0 0.0 0 0 0 579 0 0 0
16:14:09 P 4 2 Dono T/T 0 8k 0 0 0 0 0.0 0 0 0 579 0 0 0
16:14:10 P 4 2 Dono T/T 0 10k 0 0 0 0 0.0 0 0 0 579 0 0 0

After the SELECT SLEEP(60) finishes, the tables are unlocked and wsrep_desync is turned back OFF, I see immediately see node2 start to apply its backlog while staying in the Donor/Desynced state:

mycluster / ip-10-28-203-10 / Galera 2.7(r157)
Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit
time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
16:15:01 P 4 2 Dono T/T 0 68k 0 0 0 0 0.0 0 0 0 579 0 0 0
16:15:02 P 4 2 Dono T/T 0 67k 0 2k 0 1.1M 0.2 12 0 0 1k 89 0 3
16:15:03 P 4 2 Dono T/T 0 66k 0 2k 0 1.2M 0.2 14 0 0 2k 99 1 4
16:15:05 P 4 2 Dono T/T 0 64k 0 3k 0 1.6M 0.3 16 0 0 3k 94 1 4
16:15:06 P 4 2 Dono T/T 0 63k 0 3k 0 1.6M 0.5 15 0 0 3k 97 1 4
16:15:07 P 4 2 Dono T/T 0 61k 0 3k 0 1.3M 0.4 12 0 0 4k 95 1 4
16:15:08 P 4 2 Dono T/T 0 60k 0 3k 0 1.6M 0.2 16 0 0 4k 100 1 4
16:15:10 P 4 2 Dono T/T 0 58k 0 3k 0 1.6M 0.3 15 0 0 5k 100 1 4
16:15:11 P 4 2 Dono T/T 0 56k 0 3k 0 1.6M 0.2 15 0 0 6k 96 1 3
16:15:12 P 4 2 Dono T/T 0 55k 0 2k 0 1.3M 0.4 15 0 0 6k 96 1 3

But I also see flow control messages being sent now from this node.

Additionally, when the node is about half-way caught up, I see it change from Donor to Joiner:

16:15:27 P 4 2 Dono T/T 0 34k 0 3k 0 1.5M 0.2 15 0 0 6k 94 1 3
16:15:28 P 4 2 Dono T/T 0 33k 0 3k 0 1.6M 0.3 13 0 0 6k 100 1 4
16:15:29 P 4 2 Dono T/T 0 31k 0 3k 0 1.7M 0.2 14 0 0 6k 99 1 4
16:15:30 P 4 2 Join T/T 0 29k 0 3k 0 1.4M 0.4 10 0 0 6k 97 1 4
16:15:32 P 4 2 Join T/T 0 26k 0 4k 0 2.0M 0.3 21 0 0 6k 92 1 3
16:15:33 P 4 2 Join T/T 0 24k 0 3k 0 1.6M 0.2 16 0 0 6k 96 1 4
16:15:34 P 4 2 Join T/T 0 24k 0 2k 0 1018K 0.7 11 0 0 6k 96 1 4
16:15:35 P 4 2 Join T/T 0 22k 0 3k 0 1.6M 0.4 15 0 0 6k 94 1 3
16:15:37 P 4 2 Join T/T 0 20k 0 3k 0 1.5M 0.2 15 0 0 6k 86 1 3

I have not changed any FC settings: gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO;

And the config is nothing fancy:

[mysqld]
datadir = /var/lib/mysql
log_error = error.log

binlog_format = ROW

innodb_buffer_pool_size = 128M
innodb_log_file_size = 64M
innodb_flush_method = O_DIRECT
innodb_file_per_table
innodb_flush_log_at_trx_commit = 0

wsrep_cluster_name = mycluster
wsrep_node_name = ip-10-28-203-10

wsrep_provider = /usr/lib64/libgalera_smm.so
wsrep_provider_options = ""

wsrep_slave_threads = 4
wsrep_auto_increment_control = ON

wsrep_sst_method = xtrabackup
wsrep_sst_auth = sst:secret

innodb_locks_unsafe_for_binlog = 1
innodb_autoinc_lock_mode = 2

So, my questions (and what I think may be bugs):
1) Why is flow control being sent from node2 before it is back in Synced state?
2) Why is the node flipping from Donor/Desynced back to Joined state when it does?

See original description

Jay Janssen (jay-janssen) on 2013-10-09

description:

updated

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2013-10-09:

wsrep_desync.txt Edit (46.1 KiB, text/plain)

Attaching a file for fixed width viewing.

Interestingly, if I do not set wsrep_desync=OFF right away, I don't see the flow control issue.

Krunal Bauskar (krunal-bauskar) on 2015-11-17

Changed in percona-xtradb-cluster:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1082

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

wsrep_desync.txt Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.