Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Bug #1522385
Comment #1

Comment 1 for bug 1522385

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2015-12-17:

Scenario-1:
=========
I tried reproducing the issue and found some interesting observation.
Let me list them down.
1. Started master server. Some important configuration to note.
[mysqld]
log-bin=mysql-bin
server-id=1
log-slave-updates=true
gtid-mode=on
enforce-gtid-consistency=true
-------------
2. Started pxc-node-1 and 2.
[mysqld]
log-bin=mysql-bin
server-id=2
log-slave-updates=true
gtid-mode=on
enforce-gtid-consistency=true
-------------
3. Select one of the pxc-node (node-1 in my case) and make it an async slave
point to master that we started in step-1
------------
4. Execute following workload on master
create table t (i int) engine=innodb;
insert into t values (1), (2), (3), (4);
select * from t;
As expected I gtid has been properly captured to cover the needed actions:
mysql> show global variables like 'gtid_executed';
gtid_executed e20191fe-9fdc-11e5-9ff6-340286b888be:1-5
------------
5. Moving to slave. Check if the async-replication was done.
(Same can be confirmed out even on node-2 which is replicting from node-1 using galera)
mysql> show global variables like 'gtid_executed';
gtid_executed e20191fe-9fdc-11e5-9ff6-340286b888be:4-5
mysql> select * from t;
------
i
------
1
2
3
4
------
4 rows in set (0.00 sec)
------------
6. Execute following workload on node-1.
delete from t where i = 2;
This action too is reflected as expected (on node-2 also)
mysql> show global variables like 'gtid_executed';
gtid_executed 122fdc30-6023-ee1a-7a64-f545ef74c7ac:1,
e20191fe-9fdc-11e5-9ff6-340286b888be:4-5
Note: There are 2 uuid. One is GTID from MySQL other is GTID/UUID from Galera and both has different
seqno to follow. Also, note that by directly modifying slave we have put slave and master out of sync.
------------
7. Run the same delete statement on master. Statement will pass on master and seqno will increment
as expected.
mysql> show global variables like 'gtid_executed';
gtid_executed e20191fe-9fdc-11e5-9ff6-340286b888be:1-6
But when this stmt is replicated on async slave then the record is not found which will cause
slave thrd to generate following error and stop SQL slave thread.
<snippet>
2015-12-11 13:31:32 8881 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 1645, Error_code: 1032
2015-12-11 13:31:32 8881 [Warning] Slave: Can't find record in 't' Error_code: 1032
2015-12-11 13:31:32 8881 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 1441
</snippet>
This is expected as user has maliciously madde them inconsistent.
-------------
Now onwards master and slave are out of sync.
Master ops can continues and seqno increases but doesn't affect async slave which in this case is galera-node-1.
Operation on galera-node-1 are logged using separate uuid which are replicated as expected.
This behavior looks perfectly fine and inline with understanding.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Scenario-2:
=========
This time we tried running the slave with --slave_skip_errors=1032 that will cause it to ignore the error if encountered.
This has following effect
master:
mysql> show global variables like 'gtid_executed';
gtid_executed 9590be0f-9fe4-11e5-a029-340286b888be:1-6
node-1:
mysql> show global variables like 'gtid_executed';
gtid_executed 6392b39e-601b-ee1a-6579-dd7933d63aeb:1,
9590be0f-9fe4-11e5-a029-340286b888be:4-6
node-2:
mysql> show global variables like 'gtid_executed';
gtid_executed 6392b39e-601b-ee1a-6579-dd7933d63aeb:1,
9590be0f-9fe4-11e5-a029-340286b888be:4-5
Note: Node-2 hasn't seen the dummy trx that node-1 has and so the seqno is still limited to 5.

Scenario-1:
=========
I tried reproducing the issue and found some interesting observation.
Let me list them down.
1. Started master server. Some important configuration to note.
[mysqld]
log-bin=mysql-bin
server-id=1
log-slave-updates=true
gtid-mode=on
enforce-gtid-consistency=true
-------------
2. Started pxc-node-1 and 2.
[mysqld]
log-bin=mysql-bin
server-id=2
log-slave-updates=true
gtid-mode=on
enforce-gtid-consistency=true
-------------
3. Select one of the pxc-node (node-1 in my case) and make it an async slave
point to master that we started in step-1
------------
4. Execute following workload on master
create table t (i int) engine=innodb;
insert into t values (1), (2), (3), (4);
select * from t;
As expected I gtid has been properly captured to cover the needed actions:
mysql> show global variables like 'gtid_executed';
gtid_executed	e20191fe-9fdc-11e5-9ff6-340286b888be:1-5
------------
5. Moving to slave. Check if the async-replication was done.
(Same can be confirmed out even on node-2 which is replicting from node-1 using galera)
mysql> show global variables like 'gtid_executed';
gtid_executed	e20191fe-9fdc-11e5-9ff6-340286b888be:4-5
mysql> select * from t;
------
i
------
1
2
3
4
------
4 rows in set (0.00 sec)
------------
6. Execute following workload on node-1.
delete from t where i = 2;
This action too is reflected as expected (on node-2 also)
mysql> show global variables like 'gtid_executed';
gtid_executed	122fdc30-6023-ee1a-7a64-f545ef74c7ac:1,
e20191fe-9fdc-11e5-9ff6-340286b888be:4-5
Note: There are 2 uuid. One is GTID from MySQL other is GTID/UUID from Galera and both has different
seqno to follow. Also, note that by directly modifying slave we have put slave and master out of sync.
------------
7. Run the same delete statement on master. Statement will pass on master and seqno will increment
as expected.
mysql> show global variables like 'gtid_executed';
gtid_executed	e20191fe-9fdc-11e5-9ff6-340286b888be:1-6
But when this stmt is replicated on async slave then the record is not found which will cause
slave thrd to generate following error and stop SQL slave thread.
<snippet>
2015-12-11 13:31:32 8881 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 1645, Error_code: 1032
2015-12-11 13:31:32 8881 [Warning] Slave: Can't find record in 't' Error_code: 1032
2015-12-11 13:31:32 8881 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 1441
</snippet>
This is expected as user has maliciously madde them inconsistent.
-------------
Now onwards master and slave are out of sync.
Master ops can continues and seqno increases but doesn't affect async slave which in this case is galera-node-1.
Operation on galera-node-1 are logged using separate uuid which are replicated as expected.
This behavior looks perfectly fine and inline with understanding.
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Scenario-2:
=========
This time we tried running the slave with --slave_skip_errors=1032 that will cause it to ignore the error if encountered.
This has following effect
master:
mysql> show global variables like 'gtid_executed';
gtid_executed	9590be0f-9fe4-11e5-a029-340286b888be:1-6
node-1:
mysql> show global variables like 'gtid_executed';
gtid_executed	6392b39e-601b-ee1a-6579-dd7933d63aeb:1,
9590be0f-9fe4-11e5-a029-340286b888be:4-6
node-2:
mysql> show global variables like 'gtid_executed';
gtid_executed	6392b39e-601b-ee1a-6579-dd7933d63aeb:1,
9590be0f-9fe4-11e5-a029-340286b888be:4-5
Note: Node-2 hasn't seen the dummy trx that node-1 has and so the seqno is still limited to 5.