Comment 6 for bug 1735555

Revision history for this message
Vlad Lesin (vlad-lesin) wrote :

There is some progress in understanding the case.

Here https://github.com/vlad-lesin/percona-server/tree/lp-1735555-xa-transaction-lock-5.7.19-logging-and-stable-test is the branch with additional logging and some facilities for debugging. It's based on 5.7.19 and all the following analysis is for 5.7.19 too. There is also stable mtr test suite which uses DEBUG_SYNC and DEBUG system variables. The code is quite dirty, so don't pay attention on cosmetics, it's just for debugging.

The first, the bug description is the same as I see in the error log after I added some diagnostics.

What is happening:

1) When slave thread executes "INSERT INTO t2 VALUES (1, 100000)" the supremum of the page which contains records if t1 for pk values 40000-90000 is S-locked. The lock must be released on commit, but commit for this transaction has not yet executed, so the supremum stays S-locked.

2) After some binlog events "INSERT INTO t1 VALUES (85000, NULL)" is executed and the supremum is tried to be X-locked. But it can't be X-locked because it has been already S-locked on step 1. As there was not commit for the transaction in (1), the supremum stays S-locked and after timeout the current transaction tries to X-lock the supremum again with the same result.

That's why slave thread is stopped.

The question is why the test is unstable. The answer is "because of purge". When some records are deleted from innodb page, they are marked as "deleted", but they still consume the space inside the page. "Purge" process deallocates this space. So, what's going on when the test passes successfully?

1) When "DELETE FROM t1 WHERE t1_pk IN (90000, 80000)" is executed the correspondent events inside of the page are marked as "deleted".

2) When slave thread executes "INSERT INTO t2 VALUES (1, 100000)" the supremum of the page which contains records of t1 for pk values 40000-90000 is S-locked.

3) After some binlog events "INSERT INTO t1 VALUES (85000, NULL)" is executed. As there was not "purge" on the page, there is no enough space in the page for the new record, the record is inserted in another page and the supremum of another space is X-locked successfully for this operation.

What was done to make the test stable(for 5.7.19)?

1) Wait until "DELETE FROM t1 WHERE t1_pk IN (90000, 80000)" is committed and deleted records are unlocked.

2) Wait until "purge" thread frees space from deleted records inside of the page.

3) Continue the test execution.

Despite my test(mysql-test/suite/rpl/t/lp-1735555-sync-points.test in the above branch) is stable for 5.7.19, I can't repeat the bug on 5.7.20.