Activity log for bug #1237500

Date Who What changed Old value New value Message
2013-10-09 16:21:52 Jay Janssen bug added bug
2013-10-09 16:22:21 Jay Janssen description I have a two node cluster taking some sysbench update workload on the first node: mycluster / ip-10-28-26-53 / Galera 2.7(r157) Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind 16:13:49 P 4 2 Sync T/T 61 0 2k 3 907K 24.0 0.3 0 0 0 719 15 0 2 16:13:51 P 4 2 Sync T/T 0 0 2k 3 1.1M 24.0 0.2 0 0 0 554 6 0 2 16:13:52 P 4 2 Sync T/T 61 0 2k 1 915K 8.0 0.4 0 0 0 595 28 0 4 16:13:53 P 4 2 Sync T/T 20 0 1k 4 777K 32.0 0.2 0 0 0 519 25 0 5 16:13:55 P 4 2 Sync T/T 0 0 2k 2 994K 16.0 0.1 0 0 0 973 3 0 2 16:13:57 P 4 2 Sync T/T 61 0 2k 4 1.1M 32.0 0.2 0 0 0 359 4 0 2 16:13:58 P 4 2 Sync T/T 64 0 2k 1 845K 8.0 0.4 0 0 0 532 8 0 1 16:14:00 P 4 2 Sync T/T 0 0 2k 5 1.2M 40.0 0.2 0 0 0 736 8 0 2 There is some flow control, but the cluster is operating normally. Now, on the second node I execute this: node2> set global wsrep_desync=ON; flush tables with read lock; select sleep( 60 ); unlock tables; set global wsrep_desync=OFF; show global status like 'wsrep_local_recv_queue'; I can see flow control immediately stop and the queue start to grow on node2: node1: 16:14:02 P 4 2 Sync T/T 9 0 2k 0 837K 0 0.0 0 0 0 841 17 0 5 16:14:03 P 4 2 Sync T/T 0 0 2k 0 1006K 0 0.0 0 0 0 1k 7 0 2 16:14:05 P 4 2 Sync T/T 0 0 2k 0 1012K 0 0.0 0 0 0 2k 11 0 4 16:14:07 P 4 2 Sync T/T 7 0 2k 0 1.1M 0 0.0 0 0 0 2k 10 0 3 16:14:09 P 4 2 Sync T/T 0 0 2k 0 810K 0 0.0 0 0 0 3k 7 0 2 16:14:10 P 4 2 Sync T/T 0 0 1k 0 655K 0 0.0 0 0 0 3k 41 1 9 16:14:11 P 4 2 Sync T/T 0 0 2k 0 1.0M 0 0.0 0 0 0 3k 26 0 6 node2: mycluster / ip-10-28-203-10 / Galera 2.7(r157) Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind 16:13:54 P 4 2 Sync T/T 0 0 0 1k 0 739K 0.2 17 0 0 482 73 1 3 16:13:55 P 4 2 Sync T/T 0 0 0 1k 0 783K 0.1 11 0 0 531 46 1 2 16:13:56 P 4 2 Sync T/T 0 0 0 1k 0 688K 0.1 12 0 0 691 59 1 2 16:13:57 P 4 2 Sync T/T 0 0 0 1k 0 643K 0.1 17 0 0 349 72 1 3 16:13:58 P 4 2 Sync T/T 0 0 0 1k 0 714K 0.1 16 0 0 478 74 1 3 16:13:59 P 4 2 Sync T/T 0 0 0 1k 0 618K 0.2 16 0 0 546 80 1 3 16:14:01 P 4 2 Sync T/T 0 0 0 1k 0 712K 0.2 15 0 0 643 84 1 3 16:14:02 P 4 2 Dono T/T 0 404 0 866 0 459K 0.0 11 0 0 579 75 1 3 16:14:03 P 4 2 Dono T/T 0 2k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:04 P 4 2 Dono T/T 0 3k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:05 P 4 2 Dono T/T 0 4k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:06 P 4 2 Dono T/T 0 6k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:07 P 4 2 Dono T/T 0 7k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:09 P 4 2 Dono T/T 0 8k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:10 P 4 2 Dono T/T 0 10k 0 0 0 0 0.0 0 0 0 579 0 0 0 After the SELECT SLEEP(60) finishes, the tables are unlocked and wsrep_desync is turned back OFF, I see immediately see node2 start to apply its backlog while staying in the Donor/Desynced state: mycluster / ip-10-28-203-10 / Galera 2.7(r157) Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind 16:15:01 P 4 2 Dono T/T 0 68k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:15:02 P 4 2 Dono T/T 0 67k 0 2k 0 1.1M 0.2 12 0 0 1k 89 0 3 16:15:03 P 4 2 Dono T/T 0 66k 0 2k 0 1.2M 0.2 14 0 0 2k 99 1 4 16:15:05 P 4 2 Dono T/T 0 64k 0 3k 0 1.6M 0.3 16 0 0 3k 94 1 4 16:15:06 P 4 2 Dono T/T 0 63k 0 3k 0 1.6M 0.5 15 0 0 3k 97 1 4 16:15:07 P 4 2 Dono T/T 0 61k 0 3k 0 1.3M 0.4 12 0 0 4k 95 1 4 16:15:08 P 4 2 Dono T/T 0 60k 0 3k 0 1.6M 0.2 16 0 0 4k 100 1 4 16:15:10 P 4 2 Dono T/T 0 58k 0 3k 0 1.6M 0.3 15 0 0 5k 100 1 4 16:15:11 P 4 2 Dono T/T 0 56k 0 3k 0 1.6M 0.2 15 0 0 6k 96 1 3 16:15:12 P 4 2 Dono T/T 0 55k 0 2k 0 1.3M 0.4 15 0 0 6k 96 1 3 But I also see flow control messages being sent now from this node. Additionally, when the node is about half-way caught up, I see it change from Donor to Joiner: 16:15:27 P 4 2 Dono T/T 0 34k 0 3k 0 1.5M 0.2 15 0 0 6k 94 1 3 16:15:28 P 4 2 Dono T/T 0 33k 0 3k 0 1.6M 0.3 13 0 0 6k 100 1 4 16:15:29 P 4 2 Dono T/T 0 31k 0 3k 0 1.7M 0.2 14 0 0 6k 99 1 4 16:15:30 P 4 2 Join T/T 0 29k 0 3k 0 1.4M 0.4 10 0 0 6k 97 1 4 16:15:32 P 4 2 Join T/T 0 26k 0 4k 0 2.0M 0.3 21 0 0 6k 92 1 3 16:15:33 P 4 2 Join T/T 0 24k 0 3k 0 1.6M 0.2 16 0 0 6k 96 1 4 16:15:34 P 4 2 Join T/T 0 24k 0 2k 0 1018K 0.7 11 0 0 6k 96 1 4 16:15:35 P 4 2 Join T/T 0 22k 0 3k 0 1.6M 0.4 15 0 0 6k 94 1 3 16:15:37 P 4 2 Join T/T 0 20k 0 3k 0 1.5M 0.2 15 0 0 6k 86 1 3 I have not changed any FC settings: gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; And the config is nothing fancy: [mysqld] datadir = /var/lib/mysql log_error = error.log binlog_format = ROW innodb_buffer_pool_size = 128M innodb_log_file_size = 64M innodb_flush_method = O_DIRECT innodb_file_per_table innodb_flush_log_at_trx_commit = 0 wsrep_cluster_name = mycluster wsrep_node_name = ip-10-28-203-10 wsrep_provider = /usr/lib64/libgalera_smm.so wsrep_provider_options = "" wsrep_slave_threads = 4 wsrep_auto_increment_control = ON wsrep_sst_method = xtrabackup wsrep_sst_auth = sst:secret innodb_locks_unsafe_for_binlog = 1 innodb_autoinc_lock_mode = 2 So, my questions (and what I think may be bugs): 1) Why is flow control being sent from node2 before it is back in Synced state? 2) Why is the node flipping from Donor/Desynced back to Joined state when it does? I have a two node cluster taking some sysbench update workload on the first node: mycluster / ip-10-28-26-53 / Galera 2.7(r157) Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit     time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind 16:13:49 P 4 2 Sync T/T 61 0 2k 3 907K 24.0 0.3 0 0 0 719 15 0 2 16:13:51 P 4 2 Sync T/T 0 0 2k 3 1.1M 24.0 0.2 0 0 0 554 6 0 2 16:13:52 P 4 2 Sync T/T 61 0 2k 1 915K 8.0 0.4 0 0 0 595 28 0 4 16:13:53 P 4 2 Sync T/T 20 0 1k 4 777K 32.0 0.2 0 0 0 519 25 0 5 16:13:55 P 4 2 Sync T/T 0 0 2k 2 994K 16.0 0.1 0 0 0 973 3 0 2 16:13:57 P 4 2 Sync T/T 61 0 2k 4 1.1M 32.0 0.2 0 0 0 359 4 0 2 16:13:58 P 4 2 Sync T/T 64 0 2k 1 845K 8.0 0.4 0 0 0 532 8 0 1 16:14:00 P 4 2 Sync T/T 0 0 2k 5 1.2M 40.0 0.2 0 0 0 736 8 0 2 There is some flow control, but the cluster is operating normally. Now, on the second node I execute this: node2> set global wsrep_desync=ON; flush tables with read lock; select sleep( 60 ); unlock tables; set global wsrep_desync=OFF; show global status like 'wsrep_local_recv_queue'; I can see flow control immediately stop and the queue start to grow on node2: node1: 16:14:02 P 4 2 Sync T/T 9 0 2k 0 837K 0 0.0 0 0 0 841 17 0 5 16:14:03 P 4 2 Sync T/T 0 0 2k 0 1006K 0 0.0 0 0 0 1k 7 0 2 16:14:05 P 4 2 Sync T/T 0 0 2k 0 1012K 0 0.0 0 0 0 2k 11 0 4 16:14:07 P 4 2 Sync T/T 7 0 2k 0 1.1M 0 0.0 0 0 0 2k 10 0 3 16:14:09 P 4 2 Sync T/T 0 0 2k 0 810K 0 0.0 0 0 0 3k 7 0 2 16:14:10 P 4 2 Sync T/T 0 0 1k 0 655K 0 0.0 0 0 0 3k 41 1 9 16:14:11 P 4 2 Sync T/T 0 0 2k 0 1.0M 0 0.0 0 0 0 3k 26 0 6 node2: mycluster / ip-10-28-203-10 / Galera 2.7(r157) Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit     time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind 16:13:54 P 4 2 Sync T/T 0 0 0 1k 0 739K 0.2 17 0 0 482 73 1 3 16:13:55 P 4 2 Sync T/T 0 0 0 1k 0 783K 0.1 11 0 0 531 46 1 2 16:13:56 P 4 2 Sync T/T 0 0 0 1k 0 688K 0.1 12 0 0 691 59 1 2 16:13:57 P 4 2 Sync T/T 0 0 0 1k 0 643K 0.1 17 0 0 349 72 1 3 16:13:58 P 4 2 Sync T/T 0 0 0 1k 0 714K 0.1 16 0 0 478 74 1 3 16:13:59 P 4 2 Sync T/T 0 0 0 1k 0 618K 0.2 16 0 0 546 80 1 3 16:14:01 P 4 2 Sync T/T 0 0 0 1k 0 712K 0.2 15 0 0 643 84 1 3 16:14:02 P 4 2 Dono T/T 0 404 0 866 0 459K 0.0 11 0 0 579 75 1 3 16:14:03 P 4 2 Dono T/T 0 2k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:04 P 4 2 Dono T/T 0 3k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:05 P 4 2 Dono T/T 0 4k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:06 P 4 2 Dono T/T 0 6k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:07 P 4 2 Dono T/T 0 7k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:09 P 4 2 Dono T/T 0 8k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:14:10 P 4 2 Dono T/T 0 10k 0 0 0 0 0.0 0 0 0 579 0 0 0 After the SELECT SLEEP(60) finishes, the tables are unlocked and wsrep_desync is turned back OFF, I see immediately see node2 start to apply its backlog while staying in the Donor/Desynced state: mycluster / ip-10-28-203-10 / Galera 2.7(r157) Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit     time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind 16:15:01 P 4 2 Dono T/T 0 68k 0 0 0 0 0.0 0 0 0 579 0 0 0 16:15:02 P 4 2 Dono T/T 0 67k 0 2k 0 1.1M 0.2 12 0 0 1k 89 0 3 16:15:03 P 4 2 Dono T/T 0 66k 0 2k 0 1.2M 0.2 14 0 0 2k 99 1 4 16:15:05 P 4 2 Dono T/T 0 64k 0 3k 0 1.6M 0.3 16 0 0 3k 94 1 4 16:15:06 P 4 2 Dono T/T 0 63k 0 3k 0 1.6M 0.5 15 0 0 3k 97 1 4 16:15:07 P 4 2 Dono T/T 0 61k 0 3k 0 1.3M 0.4 12 0 0 4k 95 1 4 16:15:08 P 4 2 Dono T/T 0 60k 0 3k 0 1.6M 0.2 16 0 0 4k 100 1 4 16:15:10 P 4 2 Dono T/T 0 58k 0 3k 0 1.6M 0.3 15 0 0 5k 100 1 4 16:15:11 P 4 2 Dono T/T 0 56k 0 3k 0 1.6M 0.2 15 0 0 6k 96 1 3 16:15:12 P 4 2 Dono T/T 0 55k 0 2k 0 1.3M 0.4 15 0 0 6k 96 1 3 But I also see flow control messages being sent now from this node. Additionally, when the node is about half-way caught up, I see it change from Donor to Joiner: 16:15:27 P 4 2 Dono T/T 0 34k 0 3k 0 1.5M 0.2 15 0 0 6k 94 1 3 16:15:28 P 4 2 Dono T/T 0 33k 0 3k 0 1.6M 0.3 13 0 0 6k 100 1 4 16:15:29 P 4 2 Dono T/T 0 31k 0 3k 0 1.7M 0.2 14 0 0 6k 99 1 4 16:15:30 P 4 2 Join T/T 0 29k 0 3k 0 1.4M 0.4 10 0 0 6k 97 1 4 16:15:32 P 4 2 Join T/T 0 26k 0 4k 0 2.0M 0.3 21 0 0 6k 92 1 3 16:15:33 P 4 2 Join T/T 0 24k 0 3k 0 1.6M 0.2 16 0 0 6k 96 1 4 16:15:34 P 4 2 Join T/T 0 24k 0 2k 0 1018K 0.7 11 0 0 6k 96 1 4 16:15:35 P 4 2 Join T/T 0 22k 0 3k 0 1.6M 0.4 15 0 0 6k 94 1 3 16:15:37 P 4 2 Join T/T 0 20k 0 3k 0 1.5M 0.2 15 0 0 6k 86 1 3 I have not changed any FC settings: gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; And the config is nothing fancy: [mysqld] datadir = /var/lib/mysql log_error = error.log binlog_format = ROW innodb_buffer_pool_size = 128M innodb_log_file_size = 64M innodb_flush_method = O_DIRECT innodb_file_per_table innodb_flush_log_at_trx_commit = 0 wsrep_cluster_name = mycluster wsrep_node_name = ip-10-28-203-10 wsrep_provider = /usr/lib64/libgalera_smm.so wsrep_provider_options = "" wsrep_slave_threads = 4 wsrep_auto_increment_control = ON wsrep_sst_method = xtrabackup wsrep_sst_auth = sst:secret innodb_locks_unsafe_for_binlog = 1 innodb_autoinc_lock_mode = 2 So, my questions (and what I think may be bugs): 1) Why is flow control being sent from node2 before it is back in Synced state? 2) Why is the node flipping from Donor/Desynced back to Joined state when it does?
2013-10-09 16:36:20 Jay Janssen attachment added wsrep_desync.txt https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1237500/+attachment/3869341/+files/wsrep_desync.txt
2015-11-17 09:05:40 Krunal Bauskar percona-xtradb-cluster: status New Confirmed
2015-11-17 09:05:43 Krunal Bauskar percona-xtradb-cluster: importance Undecided Medium