In case when the gvwstate.dat ends up empty, the PC recovery procedure claims the restore from disk is successful, and later fails with confusing error. Why the file got truncated is unclear, but it was observed in some cases.
First of all - when the file is empty or not complete - it should not return "WSREP: restore pc from disk successfully" message, there should be some initial check.
Secondly, IMHO when the gvwstate.dat is wrong, pc.recovery procedure should be aborted, and normal joining should be attempted.
This is example err log sequence when the file is present, but empty (PXC 5.7.16):
2017-03-02T09:40:40.231158Z 0 [Note] WSREP: restore pc from disk successfully
2017-03-02T09:40:40.231275Z 0 [Note] WSREP: GMCast version 0
2017-03-02T09:40:40.231608Z 0 [Note] WSREP: (00000000, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-03-02T09:40:40.231616Z 0 [Note] WSREP: (00000000, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-03-02T09:40:40.231687Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 131: invalid UUID: 00000000 (FATAL)
at gcomm/src/pc.cpp:PC():267
2017-03-02T09:40:40.231702Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
2017-03-02T09:40:40.231731Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1407: Failed to open channel 'galera_cluster' at 'gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3': -131 (State not recoverable)
2017-03-02T09:40:40.231737Z 0 [ERROR] WSREP: gcs connect failed: State not recoverable
2017-03-02T09:40:40.231740Z 0 [ERROR] WSREP: wsrep::connect(gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3) failed: 7
2017-03-02T09:40:40.231743Z 0 [ERROR] Aborting
2017-03-02T09:40:40.231747Z 0 [Note] Forcefully disconnecting 0 remaining clients
2017-03-02T09:40:40.231758Z 0 [Note] WSREP: Service disconnected.
2017-03-02T09:40:41.231829Z 0 [Note] WSREP: Some threads may fail to exit.
2017-03-02T09:40:41.231862Z 0 [Note] Binlog end
2017-03-02T09:40:41.231980Z 0 [Note] /usr/sbin/mysqld: Shutdown complete
Worth noting, that grastate.dat was still OK:
[root@11310f5a0bf9 /]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 3efe5400-aa6d-11e6-b772-625f9abee4ba
seqno: 750385
safe_to_bootstrap: 0
Error messages are less confusing when the gvwstate.dat is simply gone, also node attempts to connect and is looking for PC to join:
2017-03-02T09:57:35.258700Z 0 [Note] WSREP: restore pc from disk failed
2017-03-02T09:57:35.259193Z 0 [Note] WSREP: GMCast version 0
2017-03-02T09:57:35.259525Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-03-02T09:57:35.259532Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-03-02T09:57:35.259838Z 0 [Note] WSREP: EVS version 0
2017-03-02T09:57:35.259927Z 0 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer '172.17.0.4:,172.17.0.5:,172.17.0.2:,172.17.0.3:'
2017-03-02T09:57:35.261257Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') connection established to 7a337039 tcp://172.17.0.3:4567
2017-03-02T09:57:35.261362Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-03-02T09:57:35.261391Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') connection established to 597ad126 tcp://172.17.0.2:4567
2017-03-02T09:57:35.387108Z 0 [Note] WSREP: declaring 597ad126 at tcp://172.17.0.2:4567 stable
2017-03-02T09:57:35.387144Z 0 [Note] WSREP: declaring 7a337039 at tcp://172.17.0.3:4567 stable
2017-03-02T09:57:35.387653Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-03-02T09:57:35.387690Z 0 [Note] WSREP: view(view_id(NON_PRIM,597ad126,328) memb {
597ad126,1
7a337039,3
a922cb8f,0
} joined {
} left {
} partitioned {
bc8f69cc,0
})
2017-03-02T09:57:38.761226Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-03-02T09:58:05.766360Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():158
2017-03-02T09:58:05.766386Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-02T09:58:05.766493Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1407: Failed to open channel 'galera_cluster' at 'gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3': -110 (Connection timed out)
2017-03-02T09:58:05.766502Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-02T09:58:05.766507Z 0 [ERROR] WSREP: wsrep::connect(gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3) failed: 7
2017-03-02T09:58:05.766511Z 0 [ERROR] Aborting
Related upstream bug: https:/ /github. com/codership/ galera/ issues/ 354