3-node debian 5.6.cluster crashes/freezes

Bug #1301616 reported by chris fortescue
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Invalid
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

Hiya,
3 nodes; new setup; latest packages

percona-toolkit 2.2.7
percona-xtrabackup 2.1.8-733-1.wheezy
percona-xtradb-cluster-client-5.6 5.6.15-25.5-759.wheezy
percona-xtradb-cluster-common-5.6 5.6.15-25.5-759.wheezy
percona-xtradb-cluster-galera-3.x 213.wheezy
percona-xtradb-cluster-garbd-3.x 213.wheezy
percona-xtradb-cluster-server-5.6 5.6.15-25.5-759.wheezy
percona-xtradb-cluster-test-5.6 5.6.15-25.5-759.wheezy

node1 - started with /etc/init.d/mysql bootstrap-pxc
node2 - mysqld totally crashed and couldn't be restarted
node3 - mysqld " "

I've included the log from node3.
I read 1million rows of data into each node (1..3) concurrently without a problem. So cluster appeared to be working. We also restored a 80G mysql dump which appeared to work but now I wonder...

Once I noticed it hung (I'm testing it out), I actually resorted to kill -9 the mysql process on node1 (started with bootstrap-pxc) as both node2/3 were down and couldn't be restarted and I couldn't connect to node1 with mysql cli. Once I restarted node1, then node2/3, everything seems ok but....

I hope you can fix this because it is a bad crash and doesn't instill confidence. Let me know if I can give you anything else to provide forensics.

Below are the cnf from the bootstrap node and node3. Strangely, no log emitted on the bootstrap node1 or node 2 but node3 shows a bad exception get thrown (apparently).

-Chris

>>>> Node1.cnf <<<<<

[mysqld]
datadir=/var/lib/mysql
user=mysql
# Path to Galera library
wsrep_provider=/usr/lib/libgalera_smm.so
# Cluster connection URL contains the IPs of node#1, node#2 and node#3

wsrep_cluster_address=gcomm://10.66.2.51,10.66.2.52,10.66.2.53
#wsrep_cluster_address=gcomm://

# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW
# MyISAM storage engine has only experimental support
default_storage_engine=InnoDB
# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2
# Node #1 address
wsrep_node_address=10.66.2.51
# SST method
wsrep_sst_method=xtrabackup-v2
# Cluster name
wsrep_cluster_name=my_clf_cluster
# Authentication for SST method
wsrep_sst_auth="sstuser:s3cret"

>>>> Node3.cnf <<<<

[mysqld]
datadir=/var/lib/mysql
user=mysql
# Path to Galera library
wsrep_provider=/usr/lib/libgalera_smm.so
# Cluster connection URL contains the IPs of node#1, node#2 and node#3
wsrep_cluster_address=gcomm://10.66.2.51,10.66.2.52,10.66.2.53
# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW
# MyISAM storage engine has only experimental support
default_storage_engine=InnoDB
# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2
# Node #3 address
wsrep_node_address=10.66.2.53
# SST method
wsrep_sst_method=xtrabackup-v2
# Cluster name
wsrep_cluster_name=my_clf_cluster
# Authentication for SST method
wsrep_sst_auth="sstuser:s3cret"

Revision history for this message
chris fortescue (cfortescu) wrote :
Revision history for this message
chris fortescue (cfortescu) wrote :

I doubled the memory on the cluster to 2GB per node and it happened again. This time, it brought down all 3 nodes.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

"bad prefix" - looks like a gcache corruption fixed in Galera 3.5

Revision history for this message
chris fortescue (cfortescu) wrote :

Ok, you say it's 'fixed' and I see a standalone package, galera-25.3.5-amd64.deb, that explicitly conflicts with the cluster versions (below). I did an apt-get update but there was no update. Isn't this fix explicitly critical to anyone running a cluster? I must be missing something and humbly ask what I'm missing.

Here's what I have as of 4/23/2014,8:51am PDT
i
i percona-toolkit 2.2.7 all Advanced MySQL and system command-line tools
ii percona-xtrabackup 2.1.8-733-1.wheezy amd64 Open source backup tool for InnoDB and XtraDB
ii percona-xtradb-cluster-client-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database client binaries
ii percona-xtradb-cluster-common-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database common files (e.g. /etc/mysql/my.cnf)
ii percona-xtradb-cluster-galera-3.x 213.wheezy amd64 Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-garbd-3.x 213.wheezy amd64 Garbd components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database server binaries
ii percona-xtradb-cluster-test-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database test suite

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Chris,

It is fix committed, not released (ie. fix released) yet.

However, you can get it from TESTING:

http://www.percona.com/downloads/TESTING/Percona-XtraDB-Cluster-galera-56/galera-3.x/215/deb/

This has all the fixes of https://launchpad.net/percona-xtradb-cluster/+milestone/galera-3.5 in it.

Revision history for this message
chris fortescue (cfortescu) wrote :

@raghavendra

That did the trick! Ran an 80G restore against a 3-node cluster.

Thanks a million!

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1661

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.