Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

RBR error on IST not zeroing grastate

Bug #1180791 reported by Jay Janssen on 2013-05-16

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Galera	Status tracked in 3.x
2.x	Fix Committed	High	Yan Zhang	Galera 25.2.10
3.x	Fix Released	High	Yan Zhang	Galera 25.3.9
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.37-25.10
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.15-25.2

Bug Description

130516 10:02:30 [Note] WSREP: SST received: f9ae5241-be23-11e2-0800-9610321e6dbf:43045
130516 10:02:30 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_23.7.4
.r3843
130516 10:02:30 [Note] WSREP: Receiving IST: 24484 writesets, seqnos 43045-67529
130516 10:02:30 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.sbtest1; Can't find recor
d in 'sbtest1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_po
s 1193, Error_code: 1032
130516 10:02:30 [Warning] WSREP: RBR event 6 Delete_rows apply warning: 120, 43046
130516 10:02:30 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply app buffer: seqno:
43046, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():52
at galera/src/replicator_smm.cpp:apply_trx_ws():118

I was able to get a node stuck in this state where it continued to retry IST on every restart and got this error. The grastate.dat was not getting zeroed appropriately in this case.

[root@perconadbt mysql]# rpm -qa | grep -i percona
percona-release-0.0-1.x86_64
Percona-XtraDB-Cluster-server-5.5.30-23.7.4.406.rhel6.x86_64
Percona-XtraDB-Cluster-client-5.5.30-23.7.4.406.rhel6.x86_64
percona-xtrabackup-2.0.7-552.rhel6.x86_64
Percona-XtraDB-Cluster-galera-2.5-1.150.rhel6.x86_64
Percona-XtraDB-Cluster-shared-5.5.30-23.7.4.406.rhel6.x86_64

Related branches

lp://staging/~percona-dev/percona-xtradb-cluster/galera-3.x

lp://staging/~percona-dev/percona-xtradb-cluster/galera-25

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-05-16:

This seems to be a Galera bug: grastate invalidation code does not cover all code paths.

Changed in galera:
assignee:	nobody → Alex Yurchenko (ayurchen)
importance:	Undecided → High
milestone:	none → 24.2.5
status:	New → Confirmed

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-16:

Yes, looks like ReplicatorSMM::recv_IST Exception can mark_unsafe in addition to gu_abort or st_.mark_safe be only called after IST is fully complete.

Changed in percona-xtradb-cluster:
milestone:	none → 5.5.31-25

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-05-29

Changed in percona-xtradb-cluster:
milestone:	5.5.31-25 → 5.5.31-24.8

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-06-19

Changed in percona-xtradb-cluster:
milestone:	5.5.31-23.7.5 → 5.5.31-25

Alex Yurchenko (ayurchen) on 2013-06-28

Changed in galera:
milestone:	23.2.6 → 23.2.7

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-09-16

Changed in percona-xtradb-cluster:
milestone:	5.5.33-23.7.6 → future-5.5

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-12-15:

Tested with:

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 10:57:57 +0000
@@ -766,6 +766,7 @@
     {
         log_fatal << "receiving IST failed, node restart required: "
                   << e.what();
+ st_.mark_corrupt();
         gcs_.close();
         gu_abort();
     }

and it zeroed the grastate correctly on IST error.

However, as the error states there may be other exceptions which
node restart may fix - network issues for instance.

So, it is better to mark this closer to where it happens..

=== modified file 'galera/src/replicator_str.cpp'
--- galera/src/replicator_str.cpp 2013-11-02 17:21:57 +0000
+++ galera/src/replicator_str.cpp 2013-12-15 11:46:20 +0000
@@ -752,7 +752,15 @@
                     // processed on donor, just adjust states here
                     trx->set_state(TrxHandle::S_REPLICATING);
                     trx->set_state(TrxHandle::S_CERTIFYING);
- apply_trx(recv_ctx, trx);
+ try
+ {
+ apply_trx(recv_ctx, trx);
+ }
+ catch (gu::Exception& e)
+ {
+ st_.mark_corrupt();
+ throw;
+ }
                 }
             }
             else

Revision history for this message

Yan Zhang (yan.zhang) wrote on 2014-06-19:

@raghu

I don't understand the second patch. If ```apply_trx``` raises gu::Exception, the exception will be caught by outer try-catch-clause and mark state file corrupt(that's your first patch) immediately.

Revision history for this message

Yan Zhang (yan.zhang) wrote on 2014-07-14:

links to: https://github.com/codership/galera/issues/78

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2015-01-26:

Our fix has been reverted in lieu of fix in 78 since it covers more space.

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1348

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.