Galera

Bug #1274192
Comment #15

Comment 15 for bug 1274192

Revision history for this message

Przemek (pmalkowski) wrote on 2014-10-27:

#15

I tested with good results on:
percona2 mysql> select @@version,@@version_comment; show status like 'wsrep_provider_version';
+----------------+---------------------------------------------------------------------------------------------------+
| @@version | @@version_comment |
+----------------+---------------------------------------------------------------------------------------------------+
| 5.6.21-69.0-56 | Percona XtraDB Cluster (GPL), Release rel69.0, Revision 910, WSREP version 25.8, wsrep_25.8.r4126 |
+----------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

First I tested without using the new options, so evs.version=0 and evs.auto_evict not set. When one of the nodes starts having high packet loss or high latency, the cluster still goes into non-Primary state but after some time it recovers and later goes non-Primary again. So in general cluster status is flapping and also while the broken/delayed node is in the cluster we can observe huge commit delays.
However I was not able to end up with any node having an exception in gcomm and completely stuck like before. The wsrep_evs_delayed counter grows for the bad node, an example:

So even without using auto eviction functionality, there is much better chance a cluster will auto-recover after intermittent network problem.

After I enabled the new evs.version=1 and set evs.auto_evict=25 on all nodes, the cluster still had flapping problems because of the single bad node, but as soon as the wsrep_evs_delayed counter reached 25 for this node, it was evicted properly from the cluster and since then no more problems observed. The bad node's uuid appears in the wsrep_evs_evict_list list:
| wsrep_evs_evict_list | 572af5eb-5dd2-11e4-8f67-4ed3860f88c4

and in the error log on the bad node we can see:

2014-10-27 13:15:04 19941 [Note] WSREP: (572af5eb, 'tcp://0.0.0.0:4567') address 'tcp://192.168.90.2:4567' pointing to uuid 572af5eb is blacklisted, skipping
(...)
2014-10-27 13:15:06 19941 [Warning] WSREP: handshake with a292793c tcp://192.168.90.4:4567 failed: 'evicted'
2014-10-27 13:15:06 19941 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
at gcomm/src/gmcast_proto.cpp:handle_failed():208
2014-10-27 13:15:06 19941 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
at gcomm/src/gmcast_proto.cpp:handle_failed():208
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: terminating thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: joining thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closing backend
2014-10-27 13:15:06 19941 [Note] WSREP: Forced PC close
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closed
(...)

So the eviction funciton seems to work as expected, I have some comments though:
* All the nodes should have the evs.version=1 and set evs.auto_evict set, otherwise when only half of the nodes had it, the bad node was not entirely evicted and cluster end up in endless non-Primary state.
* Normal, clean node restart can increase the wsrep_evs_delayed counter by 1. So beware of setting the evs.auto_evict to very low values.

I tested with good results on:
percona2 mysql> select @@version,@@version_comment; show status like 'wsrep_provider_version';
+----------------+---------------------------------------------------------------------------------------------------+
| @@version      | @@version_comment                                                                                 |
+----------------+---------------------------------------------------------------------------------------------------+
| 5.6.21-69.0-56 | Percona XtraDB Cluster (GPL), Release rel69.0, Revision 910, WSREP version 25.8, wsrep_25.8.r4126 |
+----------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

First I tested without using the new options, so evs.version=0 and evs.auto_evict not set. When one of the nodes starts having high packet loss or high latency, the cluster still goes into non-Primary state but after some time it recovers and later goes non-Primary again. So in general cluster status is flapping and also while the broken/delayed node is in the cluster we can observe huge commit delays. 
However I was not able to end up with any node having an exception in gcomm and completely stuck like before. The wsrep_evs_delayed counter grows for the bad node, an example:

So even without using auto eviction functionality, there is much better chance a cluster will auto-recover after intermittent network problem.

After I enabled the new evs.version=1 and set evs.auto_evict=25 on all nodes, the cluster still had flapping problems because of the single bad node, but as soon as the  wsrep_evs_delayed counter reached 25 for this node, it was evicted properly from the cluster and since then no more problems observed. The bad node's uuid appears in the wsrep_evs_evict_list list:
| wsrep_evs_evict_list         | 572af5eb-5dd2-11e4-8f67-4ed3860f88c4

and in the error log on the bad node we can see:

2014-10-27 13:15:04 19941 [Note] WSREP: (572af5eb, 'tcp://0.0.0.0:4567') address 'tcp://192.168.90.2:4567' pointing to uuid 572af5eb is blacklisted, skipping
(...)
2014-10-27 13:15:06 19941 [Warning] WSREP: handshake with a292793c tcp://192.168.90.4:4567 failed: 'evicted'
2014-10-27 13:15:06 19941 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
         at gcomm/src/gmcast_proto.cpp:handle_failed():208
2014-10-27 13:15:06 19941 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
         at gcomm/src/gmcast_proto.cpp:handle_failed():208
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: terminating thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: joining thread
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closing backend
2014-10-27 13:15:06 19941 [Note] WSREP: Forced PC close
2014-10-27 13:15:06 19941 [Note] WSREP: gcomm: closed
(...)