Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Bug #1260713
Comment #7

Comment 7 for bug 1260713

Revision history for this message

Ed Fisher (ed-m) wrote on 2014-07-12:

We ran in to this bug (or something similar) in production this morning. The total effect was to bring the entire cluster down, since (as you'll see) two nodes shut themselves down and the third got a Signal 11 2-3 seconds later. There weren't any signs of performance problems on any of the nodes at the time, and the cluster had been stable for quite some time until now.

My only thought here is that Jira could be setting foreign_key_checks=0 before doing some internal maintenance, but that doesn't seem likely. Jira was configured at the time to point to an haproxy listener that spreads writes between all three servers. It no longer is.

Two nodes threw this same error, or very similar:

2014-07-12 07:40:30 13362 [ERROR] Slave SQL: Could not execute Delete_rows event on table jira.AO_E8B6CC_MESSAGE; Cannot delete or update a parent row: a foreign key constraint fails (`jira`.`AO_E8B6CC_MESSAGE_TAG`, CONSTRAINT `fk_ao_e8b6cc_message_tag_message_id` FOREIGN KEY (`MESSAGE_ID`) REFERENCES `AO_E8B6CC_MESSAGE` (`ID`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 541, Error_code: 1451
2014-07-12 07:40:30 13362 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 115701779
2014-07-12 07:40:30 13362 [ERROR] WSREP: Failed to apply trx: source: ca04d333-fa74-11e3-94e2-7f5202c0e10c version: 3 local: 0 state: APPLYING flags: 1 conn_id: 24117450 trx_id: 77362570549 seqnos (l: 116300250, g: 115701779, s: 115701751, d: 115701766, ts: 2272383580846541)

One gave me:
https://gist.github.com/gleamicus/820f7b195c049720ea71

Another gave me:
https://gist.github.com/gleamicus/4f56bc7938e6b872ebf1

The third node, however, saw things happen like this:
https://gist.github.com/gleamicus/87cccd694312ce1286c5

-----------

Here's the last delete from AO_E8B6CC_MESSAGE_TAG that appears in the binary logs before the crashes:

# at 279619602
#140712 7:40:28 server id 3298096439 end_log_pos 279619650 CRC32 0xf39f5ed5 GTID [commit=yes]
SET @@SESSION.GTID_NEXT= 'b45609b2-0839-ee1c-76b7-60c776c75110:115691438'/*!*/;
# at 279619650
#140712 7:40:28 server id 3298096439 end_log_pos 279619727 CRC32 0x6b3f9354 Query thread_id=24117450 exec_time=0 error_code=0
SET TIMESTAMP=1405168828/*!*/;
SET @@session.sql_mode=539099136/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=2/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=192/*!*/;
BEGIN
/*!*/;
# at 279619727
#140712 7:40:28 server id 3298096439 end_log_pos 279619795 CRC32 0x2af4292a Table_map: `jira`.`AO_E8B6CC_MESSAGE_TAG` mapped to number 169
# at 279619795
#140712 7:40:28 server id 3298096439 end_log_pos 279619854 CRC32 0x13b86571 Delete_rows: table id 169 flags: STMT_END_F
### DELETE FROM `jira`.`AO_E8B6CC_MESSAGE_TAG`
### WHERE
### @1=70097
### @2=32804
### @3='audit-id-47893'
# at 279619854
#140712 7:40:28 server id 3298096439 end_log_pos 279619885 CRC32 0x73cd7fe6 Xid = 115701629
COMMIT/*!*/;
# at 279619885
#140712 7:40:28 server id 3298096439 end_log_pos 279619933 CRC32 0xb2a3ecec GTID [commit=yes]
SET @@SESSION.GTID_NEXT= 'b45609b2-0839-ee1c-76b7-60c776c75110:115691439'/*!*/;

Table structures:

CREATE TABLE `AO_E8B6CC_MESSAGE_TAG` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `MESSAGE_ID` int(11) DEFAULT NULL,
  `TAG` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`ID`),
  KEY `index_ao_e8b6cc_mes1391090780` (`MESSAGE_ID`),
  CONSTRAINT `fk_ao_e8b6cc_message_tag_message_id` FOREIGN KEY (`MESSAGE_ID`) REFERENCES `AO_E8B6CC_MESSAGE` (`ID`)
) TYPE=InnoDB

CREATE TABLE `AO_E8B6CC_MESSAGE` (
  `ADDRESS` varchar(255) NOT NULL,
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `PAYLOAD` longtext NOT NULL,
  `PAYLOAD_TYPE` varchar(255) NOT NULL,
  `PRIORITY` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`ID`)
) TYPE=InnoDB

Two nodes threw this same error, or very similar:

One gave me:
https://gist.github.com/gleamicus/820f7b195c049720ea71

Another gave me:
https://gist.github.com/gleamicus/4f56bc7938e6b872ebf1

The third node, however, saw things happen like this:
https://gist.github.com/gleamicus/87cccd694312ce1286c5

-----------

Here's the last delete from AO_E8B6CC_MESSAGE_TAG that appears in the binary logs before the crashes:

# at 279619602
#140712  7:40:28 server id 3298096439  end_log_pos 279619650 CRC32 0xf39f5ed5   GTID [commit=yes]
SET @@SESSION.GTID_NEXT= 'b45609b2-0839-ee1c-76b7-60c776c75110:115691438'/*!*/;
# at 279619650
#140712  7:40:28 server id 3298096439  end_log_pos 279619727 CRC32 0x6b3f9354   Query   thread_id=24117450      exec_time=0     error_code=0
SET TIMESTAMP=1405168828/*!*/;
SET @@session.sql_mode=539099136/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=2/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=192/*!*/;
BEGIN
/*!*/;
# at 279619727
#140712  7:40:28 server id 3298096439  end_log_pos 279619795 CRC32 0x2af4292a   Table_map: `jira`.`AO_E8B6CC_MESSAGE_TAG` mapped to number 169
# at 279619795
#140712  7:40:28 server id 3298096439  end_log_pos 279619854 CRC32 0x13b86571   Delete_rows: table id 169 flags: STMT_END_F
### DELETE FROM `jira`.`AO_E8B6CC_MESSAGE_TAG`
### WHERE
###   @1=70097
###   @2=32804
###   @3='audit-id-47893'
# at 279619854
#140712  7:40:28 server id 3298096439  end_log_pos 279619885 CRC32 0x73cd7fe6   Xid = 115701629
COMMIT/*!*/;
# at 279619885
#140712  7:40:28 server id 3298096439  end_log_pos 279619933 CRC32 0xb2a3ecec   GTID [commit=yes]
SET @@SESSION.GTID_NEXT= 'b45609b2-0839-ee1c-76b7-60c776c75110:115691439'/*!*/;

Table structures: