Bad network on one node takes a whole cluster down

Bug #1274192 reported by Przemek
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
Fix Committed
Undecided
Yan Zhang
3.x
Fix Committed
Undecided
Yan Zhang
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Confirmed
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

I've made a test cluster of 10 VM nodes, all running within the same VM host.
During the tests I was not putting any load on any node. All nodes run on Percona XtraDB Cluster 5.5.34-55, wsrep_25.9.r3928, Galera 2.8(r165).
All evs. variables are default. Error logs for both tests attached.

When I introduce network latency and packet loss on *just one* of the nodes, it makes huge problems for the whole cluster like long periods of non-primary state on all nodes, through permanent non-primary state on all nodes to some nodes being stuck in false faith that cluster is full and operational while it's not.

This is how I introduced bad network on first node:

[root@percona1 ~]# tc qdisc change dev eth1 root netem loss 20% delay 150ms 20ms distribution normal
[root@percona1 ~]# tc qdisc show
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 8003: dev eth1 root refcnt 2 limit 1000 delay 150.0ms 20.0ms loss 20%

In the first test I am just trying to restart another node while first node has unstable network:

Trying percona2 node restart at:
140129 16:10:02 [Note] /usr/sbin/mysqld: Normal shutdown

It failed to start again - see it's error log.

A while later, view from node6:

LefredPXC / percona6 / Galera 2.8(r165)
Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit
    time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
16:10:35 P 38 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:10:38 P 38 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:10:42 P 38 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:10:45 N 615 3 Init F/T 0 0 0 1 0 256 0.0 0 0 0 0 0 0 0
16:10:48 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:10:51 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:10:54 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:10:57 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:00 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:03 N 615 1 Init F/T 0 0 0 2 0 256 0.0 0 0 0 0 0 0 0
16:11:06 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:09 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:12 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:15 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:18 N 615 1 Init F/T 0 0 0 1 0 128 0.0 0 0 0 0 0 0 0
LefredPXC / percona6 / Galera 2.8(r165)
Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit
    time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
16:11:21 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:24 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:27 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:30 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:33 N 615 1 Init F/T 0 0 0 1 0 128 0.0 0 0 0 0 0 0 0
16:11:36 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:39 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:42 N 615 9 Init F/T 0 0 0 1 0 616 0.0 0 0 0 0 0 0 0
16:11:45 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:48 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:51 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:54 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:11:57 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:12:00 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:12:03 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0

percona3 mysql> show status like 'ws%';
+----------------------------+-------------------------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+----------------------------+-------------------------------------------------------------------------------------------------------------------+
| wsrep_local_state_uuid | eb4b0cbb-88ea-11e3-bcab-160cab62cdb7 |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 0 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 11 |
| wsrep_received_bytes | 3900 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | unspecified,unspecified,unspecified,unspecified,unspecified,unspecified,192.168.90.4:3306,unspecified,unspecified |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 9 |
| wsrep_cluster_state_uuid | eb4b0cbb-88ea-11e3-bcab-160cab62cdb7 |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 6 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 2.8(r165) |
| wsrep_ready | OFF |
+----------------------------+-------------------------------------------------------------------------------------------------------------------+
40 rows in set (0.01 sec)

After 10 minutes - cluster stil down, all nodes non-primary:

percona1 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona2 :
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Connection to 127.0.0.1 closed.
percona3 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona4 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona5 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona6 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona7 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona8 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona9 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona10 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.

I was able to get up the cluster by SET GLOBAL wsrep_provider_options='pc.bootstrap=yes' on percona3 node. When cluster was primary again, I could start percona2 node just fine and it joined the cluster:

16:30:04 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:30:07 P 39 9 Sync T/T 0 0 0 1 0 643 0.0 0 0 0 0 0 0 0
(...)
16:31:01 P 39 9 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:31:04 P 40 10 Sync T/T 0 0 0 1 0 707 0.0 0 0 0 0 0 0 0

but just a moment later, without any action nor load:

16:31:53 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:31:56 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:31:59 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:02 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:05 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:08 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:11 N 615 3 Init F/T 0 0 0 2 0 512 0.0 0 0 0 0 0 0 0
16:32:14 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:17 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:20 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:23 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:26 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:29 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:32 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:35 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
LefredPXC / percona6 / Galera 2.8(r165)
Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit
    time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
16:32:38 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:41 N 615 1 Init F/T 0 0 0 2 0 256 0.0 0 0 0 0 0 0 0
16:32:44 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:47 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:50 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:53 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:56 N 615 1 Init F/T 0 0 0 1 0 128 0.0 0 0 0 0 0 0 0

Cluster was down in this state until it self recovered few minutes later:

16:36:36 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:36:39 P 41 10 Sync T/T 0 0 0 1 0 707 0.0 0 0 0 0 0 0 0

In second test I just joined percona1 node with already bad network set. After a while the node joined, the cluster went into non-consistent state, where some nodes got stuck with Primary state and wsrep_cluster_size = 10 while it was not true as 2 nodes went down and other one was in non-prim state. SQL queries were hanging anyway on those nodes who claimed primary state.

Tags: i38760 i40464
Revision history for this message
Przemek (pmalkowski) wrote :
Revision history for this message
Przemek (pmalkowski) wrote :

Logs for second test example

Przemek (pmalkowski)
tags: added: i38760
Revision history for this message
Przemek (pmalkowski) wrote :
Download full text (6.4 KiB)

I was also able to reproduce the problem on a smaller, 4 node cluster.
Again, only 1st node had bad network. No load traffic at all during the test. New logs attached as logs3.tgz.
After some time of working, eventually nodes 2,3 and 4 went down with status like this:

percona2 mysql> show status like 'ws%';
+----------------------------+--------------------------------------+
| Variable_name | Value |
+----------------------------+--------------------------------------+
| wsrep_local_state_uuid | eb4b0cbb-88ea-11e3-bcab-160cab62cdb7 |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 0 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 44 |
| wsrep_received_bytes | 10425 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 0 |
| wsrep_cluster_state_uuid | eb4b0cbb-88ea-11e3-bcab-160cab62cdb7 |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 18446744073709551615 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_v...

Read more...

Revision history for this message
Przemek (pmalkowski) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Looks like in #3, it went down with an exception

  }140129 21:10:16 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: a4edf89e-891d-11e3-995f-bb38ec056175 node 6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3 prim state message and local states not consistent: msg node prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,136),to_seq=141,weight=1 local state prim=1,un=1,last_seq=2,last_prim=view_id(PRIM,6b6ff9b8-891d-11e3-80c2-fe8c1e9f83f3,136),to_seq=141,weight=1 (FATAL)

This didn't happen with earlier set of tests right?

Revision history for this message
Przemek (pmalkowski) wrote :

Sorry, I forgot to add comment here.
The cluster behaviour is not always consistent - when using the same single node network packet loss params, the other nodes sometimes get stuck in non-primary state, but sometimes exit with gcomm exception.

tags: added: i40464
Revision history for this message
Muhammad Irfan (muhammad-irfan) wrote :
Download full text (4.8 KiB)

I reproduced this problem on 4 node cluster from percona1-4. PXC 5.5.37 with wsrep 2.10(r175).
I introduced bad network with loss/delay on percona1

1) After sometime cluster started malfunctioning all nodes went into wsrep_local_state_comment = Initialized and wsrep_cluster_status = non-Primary

2) I made percona2 as primary, all other nodes are into the cluster and percona1 (having network issues) still trying to connect and eventually entire cluster went down. wsrep stopped working & i have to issue kill -9 on all nodes to bring cluster up again.

[root@percona2 ~]# mysql
mysql> show status like 'wsrep%';
+----------------------------+--------------------------------------+
| Variable_name | Value |
+----------------------------+--------------------------------------+
| wsrep_local_state_uuid | 9f581f39-eb03-11e3-8eb4-97664aaec97d |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 0 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 151 |
| wsrep_received_bytes | 34204 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 0 |
| wsrep_cluster_state_uuid | 9f581f39-eb03-11e3-8eb4-97664aaec97d |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON ...

Read more...

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Przemek, @Muhammad,

Is this replicable with Galera 3.x? I tried and I am not able to replicate as
described here.

Revision history for this message
Przemek (pmalkowski) wrote :
Download full text (10.8 KiB)

Yes, I can confirm this is still the case for Galera 3.6. This is my recent test on a 9-node cluster:

percona4 mysql> show status like 'wsrep_provider_version';
+------------------------+---------------+
| Variable_name | Value |
+------------------------+---------------+
| wsrep_provider_version | 3.6(r3a949e6) |
+------------------------+---------------+
1 row in set (0.00 sec)

percona4 mysql> show variables like 'vers%';
+-------------------------+---------------------------------------------------------------------------------------------------+
| Variable_name | Value |
+-------------------------+---------------------------------------------------------------------------------------------------+
| version | 5.6.19-67.0-56 |
| version_comment | Percona XtraDB Cluster (GPL), Release rel67.0, Revision 824, WSREP version 25.6, wsrep_25.6.r4111 |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
+-------------------------+---------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)

I did introduce bad network link on one node only, this way:

[root@percona10 ~]# tc qdisc add dev eth1 root netem loss 55% delay 90ms 20ms distribution normal
[root@percona10 ~]# tc qdisc show
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 8001: dev eth1 root refcnt 2 limit 1000 delay 90.0ms 20.0ms loss 55%

I let the cluster run for some half an hour and once it went down, I stopped mysql on the culprit node, but that did not help the other nodes - some were stuck in this state:

percona7 mysql> show status like 'ws%';
+------------------------------+--------------------------------------+
| Variable_name | Value |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid | 50aced88-1bb6-11e4-881d-c288fff31cfc |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 216 |
| wsrep_replicated | 3 |
| wsrep_replicated_bytes | 712 |
| wsrep_repl_keys | 7 |
| wsrep_repl_keys_bytes | 125 |
| wsrep_repl_data_bytes | 395 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 118 |
| wsrep_received_bytes | 51544 |
| wsrep_local_commits | 1 |
| wsrep_local_cert_failures | 0 ...

Revision history for this message
Przemek (pmalkowski) wrote :

Error log from one of the nodes that failed with:
 }2014-08-05 01:18:17 14375 [ERROR] WSREP: exception from gcomm, backend must be restarted: 20ecb5d3Install message self state does not match, message state: prim=0,un=0,last_seq=1,last_prim=view_id(PRIM,0434364d,151),to_seq=647,weight=1,segment=0, local state: prim=0,un=1,last_seq=1,last_prim=view_id(PRIM,0434364d,151),to_seq=647,weight=1,segment=0 (FATAL)
  at gcomm/src/pc_proto.cpp:handle_install():1107

Revision history for this message
Yan Zhang (yan.zhang) wrote :
Revision history for this message
Yan Zhang (yan.zhang) wrote :

Bad network could make membership change all the time, but should not make any of them fail. The patch https://github.com/codership/galera/issues/92 fixes the bug(and is merged into trunk)

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Przemek,

Can you test with 5.6.21 PXC and Galera 3.8 with your setup.

Revision history for this message
Przemek (pmalkowski) wrote :
Download full text (4.1 KiB)

I tested with good results on:
percona2 mysql> select @@version,@@version_comment; show status like 'wsrep_provider_version';
+----------------+---------------------------------------------------------------------------------------------------+
| @@version | @@version_comment |
+----------------+---------------------------------------------------------------------------------------------------+
| 5.6.21-69.0-56 | Percona XtraDB Cluster (GPL), Release rel69.0, Revision 910, WSREP version 25.8, wsrep_25.8.r4126 |
+----------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

+------------------------+---------------+
| Variable_name | Value |
+------------------------+---------------+
| wsrep_provider_version | 3.8(r1dd46ba) |
+------------------------+---------------+
1 row in set (0.00 sec)

First I tested without using the new options, so evs.version=0 and evs.auto_evict not set. When one of the nodes starts having high packet loss or high latency, the cluster still goes into non-Primary state but after some time it recovers and later goes non-Primary again. So in general cluster status is flapping and also while the broken/delayed node is in the cluster we can observe huge commit delays.
However I was not able to end up with any node having an exception in gcomm and completely stuck like before. The wsrep_evs_delayed counter grows for the bad node, an example:

| wsrep_local_state_comment | Initialized
(...)
| wsrep_incoming_addresses | unspecified,unspecified,unspecified,unspecified,unspecified,192.168.90.11:3306
| wsrep_evs_delayed | fbebe800-59e2-11e4-85ec-7698aa6cc406:tcp://192.168.90.2:4567:255
| wsrep_evs_evict_list |
| wsrep_evs_repl_latency | 1.8112/1.8112/1.8112/0/1
| wsrep_evs_state | GATHER
(...)
| wsrep_cluster_status | non-Primary

So even without using auto eviction functionality, there is much better chance a cluster will auto-recover after intermittent network problem.

After I enabled the new evs.version=1 and set evs.auto_evict=25 on all nodes, the cluster still had flapping problems because of the single bad node, but as soon as the wsrep_evs_delayed counter reached 25 for this node, it was evicted properly from the cluster and since then no more problems observed. The bad node's uuid appears in the wsrep_evs_evict_list list:
| wsrep_evs_evict_list | 572af5eb-5dd2-11e4-8f67-4ed3860f88c4

and in the error log on the bad node we can see:

2014-10-27 13:15:04 19941 [Note] WSREP: (572af5eb, 'tcp://0.0.0.0:4567') address 'tcp://192.168.90.2:4567' pointing to uuid 572af5eb is blacklisted, skipping
(...)
2014-10-27 13:15:06 19941 [Warning] WSREP: handshake with a292793c tcp://192.168.90.4:4567 failed: 'evicted'
2014-10-27 13:15:06 19941 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
         at gcomm/src/gmcast_proto.cpp:handle_failed():208
201...

Read more...

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1599

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.