This was fixed by calling kill_mysql() in wsrep_replication_process(), if wsrep->recv() returns with error.
The fix was tested with 3 node cluster, where one node was disconnected from vsbes by taking the NIC down (ifdown eth1). There was constant sqlgen load against the cluster during this test.
The node run in debug mode, and following log messages show how disconnect was detected and resulted in shutdown:
:recv_nointr(): Return 113 (No route to host) in header recv
091004 12:47:17 [ERROR] vs_remote_backend.cpp:handle_up():27: VSRBackend::handle_up(): Transport failed
091004 12:47:17 [ERROR] gcs_vs.cpp:conn_run():359: poll error: 'broken backend connection', thread exiting
091004 12:47:17 [Note] DEBUG: gcs_core.c:core_msg_recv():407: returning -107: Transport endpoint is not connected
091004 12:47:17 [Note] DEBUG: gcs.c:gcs_recv_thread():471: gcs_core_recv returned -107: Transport endpoint is not connected
091004 12:47:17 [Note] gcs.c:gcs_recv_thread():573: RECV thread exiting -107: Transport endpoint is not connected
091004 12:47:17 [ERROR] mm_galera.c:mm_galera_recv():1257: gcs_recv() returned 0 (Success)
091004 12:47:17 [ERROR] wsrep recv thread exiting with status: 5
091004 12:47:17 [ERROR] starting shutdown
091004 12:47:17 [Note] Got signal 15 to shutdown mysqld
091004 12:47:17 [Note] /home/galera/mysql-5.1.38-2894/mysql/libexec/mysqld: Normal shutdown
091004 12:47:17 [Note] Before Lock_thread_count
091004 12:47:17 [Note] After lock_thread_count
091004 12:47:17 [Warning] WSREP rollback thread wakes for signal
091004 12:47:17 [Note] Event Scheduler: Purging the queue. 0 events
091004 12:47:17 [Warning] WSREP rollback thread has empty abort queue
091004 12:47:17 [Note] WSREP: rollbacker thread exiting
091004 12:47:17 [Note] wsrep closing connection to cluster
091004 12:47:17 [Note] DEBUG: vs_remote_backend.cpp:leave():150: VSRBackend::leave(): (3,0,1)
091004 12:47:17 [Note] DEBUG: gcs_vs.cpp:gcs_vs_destroy():412: received: 32782, copied: 2020
091004 12:47:17 [Note] DEBUG: gcs_vs.cpp:gcs_vs_destroy():417: gcs_vs_close(): return 0
091004 12:47:17 [Note] DEBUG: gcs.c:gcs_close():655: recv_thread() joined.
091004 12:47:17 [ERROR] mm_galera.c:mm_galera_pre_commit():1680: gcs failed for: 176029, len: 1592, rcode: -4
091004 12:47:17 [ERROR] mm_galera.c:mm_galera_pre_commit():1680: gcs failed for: 176060, len: 1212, rcode: -4
091004 12:47:17 [ERROR] mm_galera.c:mm_galera_pre_commit():1680: gcs failed for: 176065, len: 448, rcode: -4
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [Note] gcs.c:gcs_close():682: Closing slave action queue.
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 1 row modifications will roll back.
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [Note] mm_galera.c:mm_galera_disconnect():405: Closed GCS connection
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 4 row modifications will roll back.
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 1 row modifications will roll back.
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 3 row modifications will roll back.
091004 12:47:22 [Note] wsdb.c:wsdb_close():66: closing wsdb
091004 12:47:22 [Note] local.c:local_close():206: mem usage for trx hash: 80080
091004 12:47:22 [Note] local.c:local_close():209: mem usage for local cache: 21135376
091004 12:47:22 [Note] wsdb.c:wsdb_close():72: closing certification module
091004 12:47:22 [Note] certification.c:wsdb_cert_close():593: mem usage for table_idex: 8000
091004 12:47:22 [Note] certification.c:wsdb_cert_close():595: mem usage for key_idex: 2106948
091004 12:47:22 [Note] certification.c:wsdb_cert_close():599: active seqno list len: 221, size: 15931
091004 12:47:22 [Note] certification.c:wsdb_cert_close():605: purged 0 entries from table index
091004 12:47:22 [Note] certification.c:wsdb_cert_close():611: purged 316 entries from key index
091004 12:47:22 InnoDB: Starting shutdown...
091004 12:47:24 InnoDB: Shutdown completed; log sequence number 0 631205916
091004 12:47:24 [Note] /home/galera/mysql-5.1.38-2894/mysql/libexec/mysqld: Shutdown complete
Note the 15 minute delay to detect disconnect:
091004 12:31:42 [Note] DEBUG: mm_galera.c:mm_galera_recv():1265: worker: 0 with seqno: (-1 - 32514) type: GCS_ACT_COMMIT_CUT recvd
:recv_nointr(): Return 113 (No route to host) in header recv
091004 12:47:17 [ERROR] vs_remote_backend.cpp:handle_up():27: VSRBackend::handle_up(): Transport failed
The long failure detection delay is obviously an issue and will be tracked by a separate tracker.
This was fixed by calling kill_mysql() in wsrep_replicati on_process( ), if wsrep->recv() returns with error.
The fix was tested with 3 node cluster, where one node was disconnected from vsbes by taking the NIC down (ifdown eth1). There was constant sqlgen load against the cluster during this test.
The node run in debug mode, and following log messages show how disconnect was detected and resulted in shutdown:
091004 12:31:42 [Note] DEBUG: mm_galera. c:mm_galera_ recv(): 1265: worker: 0 with seqno: (-1 - 32514) type: GCS_ACT_COMMIT_CUT recvd
:recv_nointr(): Return 113 (No route to host) in header recv backend. cpp:handle_ up():27: VSRBackend: :handle_ up(): Transport failed cpp:conn_ run():359: poll error: 'broken backend connection', thread exiting c:core_ msg_recv( ):407: returning -107: Transport endpoint is not connected
091004 12:47:17 [ERROR] vs_remote_
091004 12:47:17 [ERROR] gcs_vs.
091004 12:47:17 [Note] DEBUG: gcs_core.
091004 12:47:17 [Note] DEBUG: gcs.c:gcs_ recv_thread( ):471: gcs_core_recv returned -107: Transport endpoint is not connected recv_thread( ):573: RECV thread exiting -107: Transport endpoint is not connected c:mm_galera_ recv(): 1257: gcs_recv() returned 0 (Success) mysql-5. 1.38-2894/ mysql/libexec/ mysqld: Normal shutdown
091004 12:47:17 [Note] gcs.c:gcs_
091004 12:47:17 [ERROR] mm_galera.
091004 12:47:17 [ERROR] wsrep recv thread exiting with status: 5
091004 12:47:17 [ERROR] starting shutdown
091004 12:47:17 [Note] Got signal 15 to shutdown mysqld
091004 12:47:17 [Note] /home/galera/
091004 12:47:17 [Note] Before Lock_thread_count backend. cpp:leave( ):150: VSRBackend: :leave( ): (3,0,1) cpp:gcs_ vs_destroy( ):412: received: 32782, copied: 2020 cpp:gcs_ vs_destroy( ):417: gcs_vs_close(): return 0 close() :655: recv_thread() joined. c:mm_galera_ pre_commit( ):1680: gcs failed for: 176029, len: 1592, rcode: -4 c:mm_galera_ pre_commit( ):1680: gcs failed for: 176060, len: 1212, rcode: -4 c:mm_galera_ pre_commit( ):1680: gcs failed for: 176065, len: 448, rcode: -4 close() :682: Closing slave action queue. c:mm_galera_ disconnect( ):405: Closed GCS connection wsdb_close( ):66: closing wsdb local_close( ):206: mem usage for trx hash: 80080 local_close( ):209: mem usage for local cache: 21135376 wsdb_close( ):72: closing certification module c:wsdb_ cert_close( ):593: mem usage for table_idex: 8000 c:wsdb_ cert_close( ):595: mem usage for key_idex: 2106948 c:wsdb_ cert_close( ):599: active seqno list len: 221, size: 15931 c:wsdb_ cert_close( ):605: purged 0 entries from table index c:wsdb_ cert_close( ):611: purged 316 entries from key index mysql-5. 1.38-2894/ mysql/libexec/ mysqld: Shutdown complete
091004 12:47:17 [Note] After lock_thread_count
091004 12:47:17 [Warning] WSREP rollback thread wakes for signal
091004 12:47:17 [Note] Event Scheduler: Purging the queue. 0 events
091004 12:47:17 [Warning] WSREP rollback thread has empty abort queue
091004 12:47:17 [Note] WSREP: rollbacker thread exiting
091004 12:47:17 [Note] wsrep closing connection to cluster
091004 12:47:17 [Note] DEBUG: vs_remote_
091004 12:47:17 [Note] DEBUG: gcs_vs.
091004 12:47:17 [Note] DEBUG: gcs_vs.
091004 12:47:17 [Note] DEBUG: gcs.c:gcs_
091004 12:47:17 [ERROR] mm_galera.
091004 12:47:17 [ERROR] mm_galera.
091004 12:47:17 [ERROR] mm_galera.
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [Note] gcs.c:gcs_
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 1 row modifications will roll back.
091004 12:47:17 [ERROR] WSREP connection failure
091004 12:47:17 [Note] mm_galera.
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 4 row modifications will roll back.
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 1 row modifications will roll back.
091004 12:47:17 [Warning] MySQL is closing a connection that has an active InnoDB transaction. 3 row modifications will roll back.
091004 12:47:22 [Note] wsdb.c:
091004 12:47:22 [Note] local.c:
091004 12:47:22 [Note] local.c:
091004 12:47:22 [Note] wsdb.c:
091004 12:47:22 [Note] certification.
091004 12:47:22 [Note] certification.
091004 12:47:22 [Note] certification.
091004 12:47:22 [Note] certification.
091004 12:47:22 [Note] certification.
091004 12:47:22 InnoDB: Starting shutdown...
091004 12:47:24 InnoDB: Shutdown completed; log sequence number 0 631205916
091004 12:47:24 [Note] /home/galera/
Note the 15 minute delay to detect disconnect: c:mm_galera_ recv(): 1265: worker: 0 with seqno: (-1 - 32514) type: GCS_ACT_COMMIT_CUT recvd
091004 12:31:42 [Note] DEBUG: mm_galera.
:recv_nointr(): Return 113 (No route to host) in header recv backend. cpp:handle_ up():27: VSRBackend: :handle_ up(): Transport failed
091004 12:47:17 [ERROR] vs_remote_
The long failure detection delay is obviously an issue and will be tracked by a separate tracker.