Galera

Bad network on one node takes a whole cluster down

Series 2.x
Bug #1274192

Bug #1274192 reported by Przemek on 2014-01-29

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Galera	Status tracked in 3.x
2.x	Fix Committed	Undecided	Yan Zhang
3.x	Fix Committed	Undecided	Yan Zhang
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Confirmed	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC future-5.5
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.21-25.8

Bug Description

I've made a test cluster of 10 VM nodes, all running within the same VM host.
During the tests I was not putting any load on any node. All nodes run on Percona XtraDB Cluster 5.5.34-55, wsrep_25.9.r3928, Galera 2.8(r165).
All evs. variables are default. Error logs for both tests attached.

When I introduce network latency and packet loss on *just one* of the nodes, it makes huge problems for the whole cluster like long periods of non-primary state on all nodes, through permanent non-primary state on all nodes to some nodes being stuck in false faith that cluster is full and operational while it's not.

This is how I introduced bad network on first node:

[root@percona1 ~]# tc qdisc change dev eth1 root netem loss 20% delay 150ms 20ms distribution normal
[root@percona1 ~]# tc qdisc show
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 8003: dev eth1 root refcnt 2 limit 1000 delay 150.0ms 20.0ms loss 20%

In the first test I am just trying to restart another node while first node has unstable network:

Trying percona2 node restart at:
140129 16:10:02 [Note] /usr/sbin/mysqld: Normal shutdown

It failed to start again - see it's error log.

A while later, view from node6:

LefredPXC / percona6 / Galera 2.8(r165)
Bytes Flow Conflct PApply Commit
# cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 256 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 256 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 128 0.0 0 0 0 0 0 0 0
/> Bytes Flow Conflct PApply Commit
# cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 128 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 616 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0
0 0 0.0 0 0 0 0 0 0 0

After 10 minutes - cluster stil down, all nodes non-primary:

percona1 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona2 :
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Connection to 127.0.0.1 closed.
percona3 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona4 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona5 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona6 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona7 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona8 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona9 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.
percona10 :
wsrep_cluster_status non-Primary
Connection to 127.0.0.1 closed.

I was able to get up the cluster by SET GLOBAL wsrep_provider_options='pc.bootstrap=yes' on percona3 node. When cluster was primary again, I could start percona2 node just fine and it joined the cluster:

16:30:04 N 615 9 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:30:07 P 39 9 Sync T/T 0 0 0 1 0 643 0.0 0 0 0 0 0 0 0
(...)
16:31:01 P 39 9 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:31:04 P 40 10 Sync T/T 0 0 0 1 0 707 0.0 0 0 0 0 0 0 0

but just a moment later, without any action nor load:

16:31:53 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:31:56 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:31:59 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:02 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:05 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:08 P 40 10 Sync T/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:11 N 615 3 Init F/T 0 0 0 2 0 512 0.0 0 0 0 0 0 0 0
16:32:14 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:17 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:20 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:23 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:26 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:29 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:32 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:35 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
LefredPXC / percona6 / Galera 2.8(r165)
Wsrep Cluster Node Queue Ops Bytes Flow Conflct PApply Commit
time P cnf # cmt sta Up Dn Up Dn Up Dn pau snt lcf bfa dst oooe oool wind
16:32:38 N 615 3 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:41 N 615 1 Init F/T 0 0 0 2 0 256 0.0 0 0 0 0 0 0 0
16:32:44 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:47 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:50 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:53 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:32:56 N 615 1 Init F/T 0 0 0 1 0 128 0.0 0 0 0 0 0 0 0

Cluster was down in this state until it self recovered few minutes later:

16:36:36 N 615 1 Init F/T 0 0 0 0 0 0 0.0 0 0 0 0 0 0 0
16:36:39 P 41 10 Sync T/T 0 0 0 1 0 707 0.0 0 0 0 0 0 0 0

In second test I just joined percona1 node with already bad network set. After a while the node joined, the cluster went into non-consistent state, where some nodes got stuck with Primary state and wsrep_cluster_size = 10 while it was not true as 2 nodes went down and other one was in non-prim state. SQL queries were hanging anyway on those nodes who claimed primary state.

Tags:

Revision history for this message

Przemek (pmalkowski) wrote on 2014-01-29:

logs1.tgz Edit (358.2 KiB, application/x-tar)

Revision history for this message

Przemek (pmalkowski) wrote on 2014-01-29:

logs2.tgz Edit (233.6 KiB, application/x-tar)

Logs for second test example

Przemek (pmalkowski) on 2014-01-29

tags:

added: i38760

Revision history for this message

Przemek (pmalkowski) wrote on 2014-01-29:

Download full text (6.4 KiB)

I was also able to reproduce the problem on a smaller, 4 node cluster.
Again, only 1st node had bad network. No load traffic at all during the test. New logs attached as logs3.tgz.
After some time of working, eventually nodes 2,3 and 4 went down with status like this:

I reproduced this problem on 4 node cluster from percona1-4. PXC 5.5.37 with wsrep 2.10(r175).
I introduced bad network with loss/delay on percona1

1) After sometime cluster started malfunctioning all nodes went into wsrep_local_state_comment = Initialized and wsrep_cluster_status = non-Primary

[root@percona2 ~]# mysql
mysql> show status like 'wsrep%';
+----------------------------+--------------------------------------+
| Variable_name              | Value                                |
+----------------------------+--------------------------------------+
| wsrep_local_state_uuid     | 9f581f39-eb03-11e3-8eb4-97664aaec97d |
| wsrep_protocol_version     | 4                                    |
| wsrep_last_committed       | 0                                    |
| wsrep_replicated           | 0                                    |
| wsrep_replicated_bytes     | 0                                    |
| wsrep_received             | 151                                  |
| wsrep_received_bytes       | 34204                                |
| wsrep_local_commits        | 0                                    |
| wsrep_local_cert_failures  | 0                                    |
| wsrep_local_replays        | 0                                    |
| wsrep_local_send_queue     | 0                                    |
| wsrep_local_send_queue_avg | 0.000000                             |
| wsrep_local_recv_queue     | 0                                    |
| wsrep_local_recv_queue_avg | 0.000000                             |
| wsrep_flow_control_paused  | 0.000000                             |
| wsrep_flow_control_sent    | 0                                    |
| wsrep_flow_control_recv    | 0                                    |
| wsrep_cert_deps_distance   | 0.000000                             |
| wsrep_apply_oooe           | 0.000000                             |
| wsrep_apply_oool           | 0.000000                             |
| wsrep_apply_window         | 0.000000                             |
| wsrep_commit_oooe          | 0.000000                             |
| wsrep_commit_oool          | 0.000000                             |
| wsrep_commit_window        | 0.000000                             |
| wsrep_local_state          | 0                                    |
| wsrep_local_state_comment  | Initialized                          |
| wsrep_cert_index_size      | 0                                    |
| wsrep_causal_reads         | 0                                    |
| wsrep_incoming_addresses   |                                      |
| wsrep_cluster_conf_id      | 18446744073709551615                 |
| wsrep_cluster_size         | 0                                    |
| wsrep_cluster_state_uuid   | 9f581f39-eb03-11e3-8eb4-97664aaec97d |
| wsrep_cluster_status       | non-Primary                          |
| wsrep_connected            | ON                                   |
| wsrep_local_bf_aborts      | 0                                    |
| wsrep_local_index          | 18446744073709551615                 |
| wsrep_provider_name        | Galera                               |
| wsrep_provider_vendor      | Codership Oy <info@codership.com>    |
| wsrep_provider_version     | 2.10(r175)                           |
| wsrep_ready                | OFF                                  |
+----------------------------+--------------------------------------+

mysql> select 1;
ERROR 1047 (08S01): Unknown command

[root@percona2 ~]# tail -f /var/log/mysqld.log
140605 10:08:25 [ERROR] WSREP: exception from gcomm, backend must be restarted:8f61539e-ec8c-11e3-ae76-6b64399ac83dInstall message self state does not match, message state: prim=0,un=0,last_seq=2,last_prim=view_id(PRIM,8f61539e-ec8c-11e3-ae76-6b64399ac83d,1165),to_seq=445,weight=1, local state: prim=0,un=1,last_seq=2,last_prim=view_id(PRIM,8f61539e-ec8c-11e3-ae76-6b64399ac83d,1165),to_seq=445,weight=1 (FATAL)
         at gcomm/src/pc_proto.cpp:handle_install():1047
140605 10:08:25 [Note] WSREP: Received self-leave message.
140605 10:08:25 [Note] WSREP: Flow-control interval: [0, 0]
140605 10:08:25 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140605 10:08:25 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 0)
140605 10:08:25 [Note] WSREP: RECV thread exiting 0: Success
140605 10:08:25 [Note] WSREP: New cluster view: global state: 9f581f39-eb03-11e3-8eb4-97664aaec97d:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
140605 10:08:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140605 10:08:25 [Note] WSREP: applier thread exiting (code:0)

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-07-04:

Corresponding github bug https://github.com/codership/galera/issues/71

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-08-01:

@Przemek, @Muhammad,

Is this replicable with Galera 3.x? I tried and I am not able to replicate as
described here.

Revision history for this message

Przemek (pmalkowski) wrote on 2014-08-05:

#10

Download full text (10.8 KiB)

Yes, I can confirm this is still the case for Galera 3.6. This is my recent test on a 9-node cluster:

I did introduce bad network link on one node only, this way:

[root@percona10 ~]# tc qdisc add dev eth1 root netem loss 55% delay 90ms 20ms distribution normal
[root@percona10 ~]# tc qdisc show
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 8001: dev eth1 root refcnt 2 limit 1000 delay 90.0ms 20.0ms loss 55%

I let the cluster run for some half an hour and once it went down, I stopped mysql on the culprit node, but that did not help the other nodes - some were stuck in this state:

Yes, I can confirm this is still the case for Galera 3.6. This is my recent test on a 9-node cluster:

percona4 mysql> show variables like 'vers%';
+-------------------------+---------------------------------------------------------------------------------------------------+
| Variable_name           | Value                                                                                             |
+-------------------------+---------------------------------------------------------------------------------------------------+
| version                 | 5.6.19-67.0-56                                                                                    |
| version_comment         | Percona XtraDB Cluster (GPL), Release rel67.0, Revision 824, WSREP version 25.6, wsrep_25.6.r4111 |
| version_compile_machine | x86_64                                                                                            |
| version_compile_os      | Linux                                                                                             |
+-------------------------+---------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)

I did introduce bad network link on one node only, this way:

[root@percona10 ~]# tc qdisc add dev eth1 root netem loss 55% delay 90ms 20ms distribution normal
[root@percona10 ~]# tc qdisc show
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 8001: dev eth1 root refcnt 2 limit 1000 delay 90.0ms  20.0ms loss 55%

I let the cluster run for some half an hour and once it went down, I stopped mysql on the culprit node, but that did not help the other nodes - some were stuck in this state:

percona7 mysql> show status like 'ws%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | 50aced88-1bb6-11e4-881d-c288fff31cfc |
| wsrep_protocol_version       | 6                                    |
| wsrep_last_committed         | 216                                  |
| wsrep_replicated             | 3                                    |
| wsrep_replicated_bytes       | 712                                  |
| wsrep_repl_keys              | 7                                    |
| wsrep_repl_keys_bytes        | 125                                  |
| wsrep_repl_data_bytes        | 395                                  |
| wsrep_repl_other_bytes       | 0                                    |
| wsrep_received               | 118                                  |
| wsrep_received_bytes         | 51544                                |
| wsrep_local_commits          | 1                                    |
| wsrep_local_cert_failures    | 0                                    |
| wsrep_local_replays          | 0                                    |
| wsrep_local_send_queue       | 0                                    |
| wsrep_local_send_queue_avg   | 0.000000                             |
| wsrep_local_recv_queue       | 0                                    |
| wsrep_local_recv_queue_avg   | 0.000000                             |
| wsrep_local_cached_downto    | 205                                  |
| wsrep_flow_control_paused_ns | 0                                    |
| wsrep_flow_control_paused    | 0.000000                             |
| wsrep_flow_control_sent      | 0                                    |
| wsrep_flow_control_recv      | 0                                    |
| wsrep_cert_deps_distance     | 1.916667                             |
| wsrep_apply_oooe             | 0.000000                             |
| wsrep_apply_oool             | 0.000000                             |
| wsrep_apply_window           | 1.000000                             |
| wsrep_commit_oooe            | 0.000000                             |
| wsrep_commit_oool            | 0.000000                             |
| wsrep_commit_window          | 1.000000                             |
| wsrep_local_state            | 0                                    |
| wsrep_local_state_comment    | Initialized                          |
| wsrep_cert_index_size        | 4                                    |
| wsrep_causal_reads           | 0                                    |
| wsrep_cert_interval          | 0.000000                             |
| wsrep_incoming_addresses     |                                      |
| wsrep_cluster_conf_id        | 18446744073709551615                 |
| wsrep_cluster_size           | 0                                    |
| wsrep_cluster_state_uuid     | 50aced88-1bb6-11e4-881d-c288fff31cfc |
| wsrep_cluster_status         | non-Primary                          |
| wsrep_connected              | ON                                   |
| wsrep_local_bf_aborts        | 0                                    |
| wsrep_local_index            | 18446744073709551615                 |
| wsrep_provider_name          | Galera                               |
| wsrep_provider_vendor        | Codership Oy <info@codership.com>    |
| wsrep_provider_version       | 3.6(r3a949e6)                        |
| wsrep_ready                  | OFF                                  |
+------------------------------+--------------------------------------+
47 rows in set (0.00 sec)

Some others in this state:
percona1 mysql> show status like 'ws%';
+------------------------------+-------------------------------------------------------+
| Variable_name                | Value                                                 |
+------------------------------+-------------------------------------------------------+
| wsrep_local_state_uuid       | 50aced88-1bb6-11e4-881d-c288fff31cfc                  |
| wsrep_protocol_version       | 6                                                     |
| wsrep_last_committed         | 216                                                   |
| wsrep_replicated             | 0                                                     |
| wsrep_replicated_bytes       | 0                                                     |
| wsrep_repl_keys              | 0                                                     |
| wsrep_repl_keys_bytes        | 0                                                     |
| wsrep_repl_data_bytes        | 0                                                     |
| wsrep_repl_other_bytes       | 0                                                     |
| wsrep_received               | 162                                                   |
| wsrep_received_bytes         | 65715                                                 |
| wsrep_local_commits          | 0                                                     |
| wsrep_local_cert_failures    | 0                                                     |
| wsrep_local_replays          | 0                                                     |
| wsrep_local_send_queue       | 0                                                     |
| wsrep_local_send_queue_avg   | 0.000000                                              |
| wsrep_local_recv_queue       | 0                                                     |
| wsrep_local_recv_queue_avg   | 0.018519                                              |
| wsrep_local_cached_downto    | 205                                                   |
| wsrep_flow_control_paused_ns | 0                                                     |
| wsrep_flow_control_paused    | 0.000000                                              |
| wsrep_flow_control_sent      | 0                                                     |
| wsrep_flow_control_recv      | 0                                                     |
| wsrep_cert_deps_distance     | 1.916667                                              |
| wsrep_apply_oooe             | 0.000000                                              |
| wsrep_apply_oool             | 0.000000                                              |
| wsrep_apply_window           | 1.000000                                              |
| wsrep_commit_oooe            | 0.000000                                              |
| wsrep_commit_oool            | 0.000000                                              |
| wsrep_commit_window          | 1.000000                                              |
| wsrep_local_state            | 0                                                     |
| wsrep_local_state_comment    | Initialized                                           |
| wsrep_cert_index_size        | 4                                                     |
| wsrep_causal_reads           | 0                                                     |
| wsrep_cert_interval          | 0.000000                                              |
| wsrep_incoming_addresses     | unspecified,unspecified,unspecified,192.168.90.2:3306 |
| wsrep_evs_repl_latency       | 0/0/0/0/0                                             |
| wsrep_cluster_conf_id        | 18446744073709551615                                  |
| wsrep_cluster_size           | 4                                                     |
| wsrep_cluster_state_uuid     | 50aced88-1bb6-11e4-881d-c288fff31cfc                  |
| wsrep_cluster_status         | non-Primary                                           |
| wsrep_connected              | ON                                                    |
| wsrep_local_bf_aborts        | 0                                                     |
| wsrep_local_index            | 3                                                     |
| wsrep_provider_name          | Galera                                                |
| wsrep_provider_vendor        | Codership Oy <info@codership.com>                     |
| wsrep_provider_version       | 3.6(r3a949e6)                                         |
| wsrep_ready                  | OFF                                                   |
+------------------------------+-------------------------------------------------------+
48 rows in set (0.00 sec)

Nodes that ended up with wsrep_cluster_size           | 0 have this is error log:

}2014-08-05 01:18:17 14375 [ERROR] WSREP: exception from gcomm, backend must be restarted: 20ecb5d3Install message self state does not match, message state: prim=0,un=0,last_seq=1,last_prim=view_id(PRIM,0434364d,151),to_seq=647,weight=1,segment=0, local state: prim=0,un=1,last_seq=1,last_prim=view_id(PRIM,0434364d,151),to_seq=647,weight=1,segment=0 (FATAL)
         at gcomm/src/pc_proto.cpp:handle_install():1107
2014-08-05 01:18:17 14375 [Note] WSREP: Received self-leave message.

Full error log in attachment.

Revision history for this message

Przemek (pmalkowski) wrote on 2014-08-05:

#11

percona7_error.log Edit (574.0 KiB, text/plain)

Error log from one of the nodes that failed with:
}2014-08-05 01:18:17 14375 [ERROR] WSREP: exception from gcomm, backend must be restarted: 20ecb5d3Install message self state does not match, message state: prim=0,un=0,last_seq=1,last_prim=view_id(PRIM,0434364d,151),to_seq=647,weight=1,segment=0, local state: prim=0,un=1,last_seq=1,last_prim=view_id(PRIM,0434364d,151),to_seq=647,weight=1,segment=0 (FATAL)
at gcomm/src/pc_proto.cpp:handle_install():1107

Revision history for this message

Yan Zhang (yan.zhang) wrote on 2014-08-12:

#12

related github issue. https://github.com/codership/galera/issues/92

Revision history for this message

Yan Zhang (yan.zhang) wrote on 2014-08-12:

#13

Bad network could make membership change all the time, but should not make any of them fail. The patch https://github.com/codership/galera/issues/92 fixes the bug(and is merged into trunk)

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-10-10:

#14

@Przemek,

Can you test with 5.6.21 PXC and Galera 3.8 with your setup.

Revision history for this message

Przemek (pmalkowski) wrote on 2014-10-27:

#15

Download full text (4.1 KiB)

I tested with good results on:
percona2 mysql> select @@version,@@version_comment; show status like 'wsrep_provider_version';
+----------------+---------------------------------------------------------------------------------------------------+
| @@version | @@version_comment |
+----------------+---------------------------------------------------------------------------------------------------+
| 5.6.21-69.0-56 | Percona XtraDB Cluster (GPL), Release rel69.0, Revision 910, WSREP version 25.8, wsrep_25.8.r4126 |
+----------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

First I tested without using the new options, so evs.version=0 and evs.auto_evict not set. When one of the nodes starts having high packet loss or high latency, the cluster still goes into non-Primary state but after some time it recovers and later goes non-Primary again. So in general cluster status is flapping and also while the broken/delayed node is in the cluster we can observe huge commit delays.
However I was not able to end up with any node having an exception in gcomm and completely stuck like before. The wsrep_evs_delayed counter grows for the bad node, an example:

So even without using auto eviction functionality, there is much better chance a cluster will auto-recover after intermittent network problem.

After I enabled the new evs.version=1 and set evs.auto_evict=25 on all nodes, the cluster still had flapping problems because of the single bad node, but as soon as the wsrep_evs_delayed counter reached 25 for this node, it was evicted properly from the cluster and since then no more problems observed. The bad node's uuid appears in the wsrep_evs_evict_list list:
| wsrep_evs_evict_list | 572af5eb-5dd2-11e4-8f67-4ed3860f88c4

and in the error log on the bad node we can see:

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.