Galera

terminate called after throwing an instance of 'std::out_of_range'

Bug #1232747 reported by jolan on 2013-09-29

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Galera	Status tracked in 3.x
2.x	Fix Released	High	Teemu Ollakka	Galera 25.2.9
3.x	Fix Released	High	Teemu Ollakka	Galera 25.3.3
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.37-25.10
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.15-25.3

Bug Description

Hi,

All three nodes of our Percona Cluster received exceptions in the last few hours which caused all nodes on the cluster to shut down.

dbw0 was running percona-xtradb-cluster-server-5.5 5.5.33-23.7.6-495.raring
dbc0 was running percona-xtradb-cluster-server-5.5 5.5.33-23.7.6-496.raring
dbe0 was running percona-xtradb-cluster-server-5.5 5.5.31-23.7.5-438.raring

We were planning on upgrading all nodes to 496 build of 5.5.33 but this happened before we could schedule a time to do that.

dbw0 received the exception in the summary.
dbc0/dbe0 received wsrep exceptions below.

dbw0:

terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check
04:14:30 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=140
max_threads=153
thread_count=136
connection_count=136
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 343054 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x7dc9ae]
/usr/sbin/mysqld(handle_fatal_signal+0x491)[0x6beed1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfbd0)[0x7f8f9fcc8bd0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f8f9f2f0037]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f8f9f2f3698]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)[0x7f8f9d585e8d]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5ef76)[0x7f8f9d583f76]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5efa3)[0x7f8f9d583fa3]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5f1de)[0x7f8f9d5841de]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_out_of_rangePKc+0x5d)[0x7f8f9d5d69ad]
/usr/lib/libgalera_smm.so(+0xd90f6)[0x7f8f9db090f6]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto19update_im_safe_seqsERKNS0_15MessageNodeListE+0x121)[0x7f8f9db09b41]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto11handle_joinERKNS0_11JoinMessageESt17_Rb_tree_iteratorISt4pairIKNS_4UUIDENS0_4NodeEEE+0xb60)[0x7f8f9db199a0]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto10handle_msgERKNS0_7MessageERKNS_8DatagramE+0x387)[0x7f8f9db22407]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto9handle_upEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x27b)[0x7f8f9db22dcb]
/usr/lib/libgalera_smm.so(_ZN5gcomm8Protolay7send_upERKNS_8DatagramERKNS_11ProtoUpMetaE+0x36)[0x7f8f9db24746]
/usr/lib/libgalera_smm.so(_ZN5gcomm6GMCast9handle_upEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x22a)[0x7f8f9db37e4a]
/usr/lib/libgalera_smm.so(_ZN5gcomm10Protostack8dispatchEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x58)[0x7f8f9db5ddd8]
/usr/lib/libgalera_smm.so(_ZN5gcomm12AsioProtonet8dispatchERKPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x4b)[0x7f8f9db85e8b]
/usr/lib/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket12read_handlerERKN4asio10error_codeEm+0x7a6)[0x7f8f9db688e6]
/usr/lib/libgalera_smm.so(_ZN4asio6detail7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS4_EEEEN5boost5arrayINS_14mutable_bufferELm1EEENS8_3_bi6bind_tImNS8_4_mfi3mf2ImN5gcomm13AsioTcpSocketERKNS_10error_codeEmEENSC_5list3INSC_5valueINS8_10shared_ptrISH_EEEEPFNS8_3argILi1EEEvEPFNSR_ILi2EEEvEEEEENSD_IvNSF_IvSH_SK_mEESY_EEEclESK_mi+0x94)[0x7f8f9db771c4]
/usr/lib/libgalera_smm.so(_ZN4asio6detail23reactive_socket_recv_opINS0_17consuming_buffersINS_14mutable_bufferEN5boost5arrayIS3_Lm1EEEEENS0_7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceISB_EEEES6_NS4_3_bi6bind_tImNS4_4_mfi3mf2ImN5gcomm13AsioTcpSocketERKNS_10error_codeEmEENSF_5list3INSF_5valueINS4_10shared_ptrISK_EEEEPFNS4_3argILi1EEEvEPFNSU_ILi2EEEvEEEEENSG_IvNSI_IvSK_SN_mEES11_EEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationESL_m+0xdd)[0x7f8f9db7750d]
/usr/lib/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x407)[0x7f8f9db89517]
/usr/lib/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x1b0)[0x7f8f9db872f0]
/usr/lib/libgalera_smm.so(_ZN9GCommConn3runEv+0x5e)[0x7f8f9db9fe5e]
/usr/lib/libgalera_smm.so(_ZN9GCommConn6run_fnEPv+0x9)[0x7f8f9dba3139]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e)[0x7f8f9fcc0f8e]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f8f9f3b2e1d]
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.
130929 04:14:30 mysqld_safe Number of processes running now: 0
130929 04:14:30 mysqld_safe WSREP: not restarting wsrep node automatically
130929 04:14:30 mysqld_safe mysqld from pid file /var/lib/mysql/dbw0.connected.cc.pid ended

dbc0:

130929 4:14:30 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=10620186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
}
state after handling message: evs::proto(evs::proto(e10fe36e-2559-11e3-8f39-661947538c85, GATHER, view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94)), GATHER) {
current_view=view(view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94) memb {
        239e4407-0ce7-11e3-9bee-3be82ca03086,
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,
        e10fe36e-2559-11e3-8f39-661947538c85,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=node: {idx=0,range=[6,5],safe_seq=-1} node: {idx=1,range=[0,3],safe_seq=3} node: {idx=2,range=[6,5],safe_seq=-1} },
fifo_seq=7949526,
last_sent=5,
known={
        239e4407-0ce7-11e3-9bee-3be82ca03086,evs::node{operational=1,suspected=0,installed=0,fifo_seq=66278528,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=239e4407-0ce7-11e3-9bee-3be82ca03086,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=66278527,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=1,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[0,-1],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,evs::node{operational=1,suspected=0,installed=0,fifo_seq=10620186,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=10620186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
},
}
        e10fe36e-2559-11e3-8f39-661947538c85,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=0,source=e10fe36e-2559-11e3-8f39-661947538c85,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=7949526,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[0,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
}
}130929 4:14:30 [ERROR] WSREP: exception from gcomm, backend must be restarted:nlself_i != same_view.end(): (FATAL)
         at gcomm/src/evs_proto.cpp:handle_join():3530
130929 4:14:30 [Note] WSREP: Received self-leave message.
130929 4:14:30 [Note] WSREP: Flow-control interval: [0, 0]
130929 4:14:30 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130929 4:14:30 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 27410502)
130929 4:14:30 [Note] WSREP: RECV thread exiting 0: Success
130929 4:14:30 [Note] WSREP: New cluster view: global state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:27410502, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130929 4:14:30 [Note] WSREP: Setting wsrep_ready to 0
130929 4:14:30 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130929 4:14:30 [Note] WSREP: applier thread exiting (code:0)
130929 4:14:30 [Note] WSREP: closing applier 6

dbe0:

130928 23:14:30 [Note] WSREP: evs::proto(239e4407-0ce7-11e3-9bee-3be82ca03086, GATHER, view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94)) suspecting node: 447a086c-2456-11e3-b1d8-bfcbd6dd35cd
130928 23:14:30 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=000
00000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=10620186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
}
state after handling message: evs::proto(evs::proto(239e4407-0ce7-11e3-9bee-3be82ca03086, GATHER, view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94)), GATHER) {
current_view=view(view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94) memb {
        239e4407-0ce7-11e3-9bee-3be82ca03086,
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,
        e10fe36e-2559-11e3-8f39-661947538c85,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=node: {idx=0,range=[6,5],safe_seq=-1} node: {idx=1,range=[0,3],safe_seq=3} node: {idx=2,range=[6,5],safe_seq=-1} },
fifo_seq=66278530,
last_sent=5,
known={
        239e4407-0ce7-11e3-9bee-3be82ca03086,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=0,source=239e4407-0ce7-11e3-9bee-3be82ca03086,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=6
6278530,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=1,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[0,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,evs::node{operational=1,suspected=1,installed=0,fifo_seq=10620186,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=106
20186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
},
}
        e10fe36e-2559-11e3-8f39-661947538c85,evs::node{operational=1,suspected=0,installed=0,fifo_seq=7949526,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=e10fe36e-2559-11e3-8f39-661947538c85,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=7
949526,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[0,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
}
}130928 23:14:30 [ERROR] WSREP: exception from gcomm, backend must be restarted:nlself_i != same_view.end(): (FATAL)
         at gcomm/src/evs_proto.cpp:handle_join():3530
130928 23:14:30 [Note] WSREP: Received self-leave message.
130928 23:14:30 [Note] WSREP: Flow-control interval: [0, 0]
130928 23:14:30 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130928 23:14:30 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 27410502)
130928 23:14:30 [Note] WSREP: RECV thread exiting 0: Success
130928 23:14:30 [Note] WSREP: New cluster view: global state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:27410502, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130928 23:14:30 [Note] WSREP: Setting wsrep_ready to 0
130928 23:14:30 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130928 23:14:30 [Note] WSREP: applier thread exiting (code:0)
130928 23:14:30 [Note] WSREP: closing applier 10

Tags:

Related branches

lp://staging/~codership/galera/23.2.x (Merged)

lp://staging/galera/2.x

lp://staging/galera

Ready for review for merging into lp://staging/~dbpercona/galera/Bug1348714

David Bennett: Pending requested 2014-07-25

Alex Yurchenko (ayurchen) on 2013-09-29

Changed in galera:
assignee:	nobody → Teemu Ollakka (teemu-ollakka)

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-09-30

Changed in percona-xtradb-cluster:
milestone:	none → 5.5.34-23.7.6

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-09-30:

Hi Jolan,

Could you provide a bit more context about what happened just before crashes. How long were nodes have been running, was some of the nodes restarted recently, was there any signs of network issues etc.

If there was some activity in error logs just before the crashes (within one minute or so), it would be interesting to see it too.

Revision history for this message

jolan (jolan) wrote on 2013-09-30:

dbw0 log Edit (86.1 KiB, text/plain)

Revision history for this message

jolan (jolan) wrote on 2013-09-30:

dbe0 log Edit (129.4 KiB, text/plain)

Revision history for this message

jolan (jolan) wrote on 2013-09-30:

dbc0 log Edit (100.4 KiB, text/plain)

Revision history for this message

jolan (jolan) wrote on 2013-09-30:

I attached the logs. There definitely was some sort of network event that preceded the crash and exceptions.

Also, all 3 nodes quit at the same time. dbe0's timezone was set to US Central instead of UTC so the hours don't match up but the minutes/seconds do.

Our cluster is comprised of 3 WAN nodes which are hosted in separate data centers.

dbw0 - (Dallas, TX Linode) - does reads/writes, 20ms latency to dbc0, 40ms latency to dbe0
dbc0 - (Atlanta, GA Linode) - only does mysqldump backups, 20ms latency to both dbe0/dbw0
dbe0 - (Newark, NJ Linode) - does reads/writes, 20ms latency to dbc0, 40ms latency to dbw0

Cluster was created 5 weeks ago.

dbe0 had been running percona-xtradb-cluster-server-5.5 5.5.31-23.7.5-438.raring for the whole 5 weeks.
dbc0 had been running for 4.5 days (was upgraded and restarted)
dbw0 had been running for 5.5 days (was upgraded and restarted)

We have had network hiccups before but they usually result in one node disappearing for a short time and then re-joining without incident.

There are a couple of "[Warning] WSREP: Quorum: No node with complete state:" warnings in the logs I attached which we haven't seen before.

Teemu Ollakka (teemu-ollakka) on 2013-10-07

Changed in galera:
status:	New → Confirmed

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-10-07:

Thanks for logs, they revealed the reason for the crash.

Apparently, if network partitions in certain point of group negotiation, one of the partitioned component may form a group with invalid group id, which in turn causes group ids of partitioned components to be identical. Crash happens when network connectivity returns and group tries to remerge.

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-11-05