I tested with good results on:
percona2 mysql> select @@version,@@version_comment; show status like 'wsrep_provider_version';
+----------------+---------------------------------------------------------------------------------------------------+
| @@version | @@version_comment |
+----------------+---------------------------------------------------------------------------------------------------+
| 5.6.21-69.0-56 | Percona XtraDB Cluster (GPL), Release rel69.0, Revision 910, WSREP version 25.8, wsrep_25.8.r4126 |
+----------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
+------------------------+---------------+
| Variable_name | Value |
+------------------------+---------------+
| wsrep_provider_version | 3.8(r1dd46ba) |
+------------------------+---------------+
1 row in set (0.00 sec)
First I tested without using the new options, so evs.version=0 and evs.auto_evict not set. When one of the nodes starts having high packet loss or high latency, the cluster still goes into non-Primary state but after some time it recovers and later goes non-Primary again. So in general cluster status is flapping and also while the broken/delayed node is in the cluster we can observe huge commit delays.
However I was not able to end up with any node having an exception in gcomm and completely stuck like before. The wsrep_evs_delayed counter grows for the bad node, an example:
So even without using auto eviction functionality, there is much better chance a cluster will auto-recover after intermittent network problem.
After I enabled the new evs.version=1 and set evs.auto_evict=25 on all nodes, the cluster still had flapping problems because of the single bad node, but as soon as the wsrep_evs_delayed counter reached 25 for this node, it was evicted properly from the cluster and since then no more problems observed. The bad node's uuid appears in the wsrep_evs_evict_list list:
| wsrep_evs_evict_list | 572af5eb-5dd2-11e4-8f67-4ed3860f88c4
and in the error log on the bad node we can see:
2014-10-27 13:15:04 19941 [Note] WSREP: (572af5eb, 'tcp://0.0.0.0:4567') address 'tcp://192.168.90.2:4567' pointing to uuid 572af5eb is blacklisted, skipping
(...)
2014-10-27 13:15:06 19941 [Warning] WSREP: handshake with a292793c tcp://192.168.90.4:4567 failed: 'evicted'
2014-10-27 13:15:06 19941 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
at gcomm/src/gmcast_proto.cpp:handle_failed():208
2014-10-27 13:15:06 19941 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
at gcomm/src/gmcast_proto.cpp:handle_failed():208
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: terminating thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: joining thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closing backend
2014-10-27 13:15:06 19941 [Note] WSREP: Forced PC close
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closed
(...)
So the eviction funciton seems to work as expected, I have some comments though:
* All the nodes should have the evs.version=1 and set evs.auto_evict set, otherwise when only half of the nodes had it, the bad node was not entirely evicted and cluster end up in endless non-Primary state.
* Normal, clean node restart can increase the wsrep_evs_delayed counter by 1. So beware of setting the evs.auto_evict to very low values.
I tested with good results on: @@version_ comment; show status like 'wsrep_ provider_ version' ; ------- ---+--- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -----+ ------- ---+--- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -----+ ------- ---+--- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -----+
percona2 mysql> select @@version,
+------
| @@version | @@version_comment |
+------
| 5.6.21-69.0-56 | Percona XtraDB Cluster (GPL), Release rel69.0, Revision 910, WSREP version 25.8, wsrep_25.8.r4126 |
+------
1 row in set (0.00 sec)
+------ ------- ------- ----+-- ------- ------+ ------- ------- ----+-- ------- ------+ version | 3.8(r1dd46ba) | ------- ------- ----+-- ------- ------+
| Variable_name | Value |
+------
| wsrep_provider_
+------
1 row in set (0.00 sec)
First I tested without using the new options, so evs.version=0 and evs.auto_evict not set. When one of the nodes starts having high packet loss or high latency, the cluster still goes into non-Primary state but after some time it recovers and later goes non-Primary again. So in general cluster status is flapping and also while the broken/delayed node is in the cluster we can observe huge commit delays.
However I was not able to end up with any node having an exception in gcomm and completely stuck like before. The wsrep_evs_delayed counter grows for the bad node, an example:
| wsrep_local_ state_comment | Initialized addresses | unspecified, unspecified, unspecified, unspecified, unspecified, 192.168. 90.11:3306 59e2-11e4- 85ec-7698aa6cc4 06:tcp: //192.168. 90.2:4567: 255 evict_list | repl_latency | 1.8112/ 1.8112/ 1.8112/ 0/1 status | non-Primary
(...)
| wsrep_incoming_
| wsrep_evs_delayed | fbebe800-
| wsrep_evs_
| wsrep_evs_
| wsrep_evs_state | GATHER
(...)
| wsrep_cluster_
So even without using auto eviction functionality, there is much better chance a cluster will auto-recover after intermittent network problem.
After I enabled the new evs.version=1 and set evs.auto_evict=25 on all nodes, the cluster still had flapping problems because of the single bad node, but as soon as the wsrep_evs_delayed counter reached 25 for this node, it was evicted properly from the cluster and since then no more problems observed. The bad node's uuid appears in the wsrep_evs_ evict_list list: evict_list | 572af5eb- 5dd2-11e4- 8f67-4ed3860f88 c4
| wsrep_evs_
and in the error log on the bad node we can see:
2014-10-27 13:15:04 19941 [Note] WSREP: (572af5eb, 'tcp:// 0.0.0.0: 4567') address 'tcp:// 192.168. 90.2:4567' pointing to uuid 572af5eb is blacklisted, skipping 168.90. 4:4567 failed: 'evicted' gmcast_ proto.cpp: handle_ failed( ):208 gmcast_ proto.cpp: handle_ failed( ):208
(...)
2014-10-27 13:15:06 19941 [Warning] WSREP: handshake with a292793c tcp://192.
2014-10-27 13:15:06 19941 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
at gcomm/src/
2014-10-27 13:15:06 19941 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
at gcomm/src/
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: terminating thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: joining thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closing backend
2014-10-27 13:15:06 19941 [Note] WSREP: Forced PC close
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closed
(...)
So the eviction funciton seems to work as expected, I have some comments though:
* All the nodes should have the evs.version=1 and set evs.auto_evict set, otherwise when only half of the nodes had it, the bad node was not entirely evicted and cluster end up in endless non-Primary state.
* Normal, clean node restart can increase the wsrep_evs_delayed counter by 1. So beware of setting the evs.auto_evict to very low values.