contrail-collector crash immediately after provisioning.
root@server3:/var/crashes# gdb vizd core.contrail-collec.24997.server3.1520989531
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from vizd...done.
warning: core file may not match specified executable file.
[New LWP 24997]
[New LWP 25026]
[New LWP 25033]
[New LWP 25036]
[New LWP 25031]
[New LWP 25030]
[New LWP 25034]
[New LWP 25035]
[New LWP 25028]
[New LWP 25032]
[New LWP 25027]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-collector'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 size (this=0x1ae7fd0) at /usr/include/c++/4.8/bits/basic_string.h:716
716 { return _M_rep()->_M_length; }
(gdb) bt
#0 size (this=0x1ae7fd0) at /usr/include/c++/4.8/bits/basic_string.h:716
#1 compare (__str=..., this=0x1ae7fd0) at /usr/include/c++/4.8/bits/basic_string.h:2227
#2 operator< <char, std::char_traits<char>, std::allocator<char> > (__rhs=..., __lhs=<error reading variable: Cannot access memory at address 0x55656372756f734d>)
at /usr/include/c++/4.8/bits/basic_string.h:2571
#3 operator() (this=<optimized out>, __y=..., __x=<error reading variable: Cannot access memory at address 0x55656372756f734d>) at /usr/include/c++/4.8/bits/stl_function.h:235
#4 _M_lower_bound (this=0x1ae7b10, __k="ssm::EvResourceUpdate", __y=<optimized out>, __x=0x1ae7fb0) at /usr/include/c++/4.8/bits/stl_tree.h:1141
#5 std::_Rb_tree<std::string, std::pair<std::string const, void*>, std::_Select1st<std::pair<std::string const, void*> >, std::less<std::string>, std::allocator<std::pair<std::string const, void*> > >::find (this=this@entry=0x1ae7b10, __k="ssm::EvResourceUpdate") at /usr/include/c++/4.8/bits/stl_tree.h:1792
#6 0x00000000007b02ef in find (__x="ssm::EvResourceUpdate", this=this@entry=0x1ae7b10) at /usr/include/c++/4.8/bits/stl_map.h:822
#7 find (x="ssm::EvResourceUpdate", this=this@entry=0x1ae7b10) at /usr/include/boost/ptr_container/ptr_map_adapter.hpp:278
#8 SandeshEventStatistics::Update (this=this@entry=0x1ae7b10, event_name="ssm::EvResourceUpdate", enqueue=enqueue@entry=true, fail=fail@entry=false)
at tools/sandesh/library/cpp/sandesh_statistics.cc:269
#9 0x000000000079a46c in SandeshStateMachine::UpdateEventStats (this=this@entry=0x1ae7820, event=..., enqueue=enqueue@entry=true, fail=fail@entry=false)
at tools/sandesh/library/cpp/sandesh_state_machine.cc:783
#10 0x00000000007a4025 in UpdateEventEnqueue (event=..., this=0x1ae7820) at tools/sandesh/library/cpp/sandesh_state_machine.cc:764
#11 SandeshStateMachine::Enqueue<ssm::EvResourceUpdate> (this=0x1ae7820, event=...) at tools/sandesh/library/cpp/sandesh_state_machine.cc:853
#12 0x000000000079a93a in SandeshStateMachine::ResourceUpdate (this=<optimized out>, rsc=rsc@entry=false) at tools/sandesh/library/cpp/sandesh_state_machine.cc:734
#13 0x00000000005f584e in Collector::RedisUpdate (this=0x1a56770, rsc=rsc@entry=false) at controller/src/analytics/collector.cc:127
#14 0x000000000066e113 in RedisUpdate (rsc=false, this=0x7ffd816cf290) at controller/src/analytics/viz_collector.h:78
#15 OpServerProxy::OpServerImpl::ToOpsConnDown (this=0x1a4edd0) at controller/src/analytics/OpServerProxy.cc:345
#16 0x000000000060b4c6 in operator() (this=0x1a54718) at /usr/include/boost/function/function_template.hpp:767
#17 RedisAsyncConnection::RAC_DisconnectCallbackProcess (this=0x1a54620, c=<optimized out>, status=<optimized out>) at controller/src/analytics/redis_connection.cc:163
#18 0x0000000000609b0d in operator() (a1=-1, a0=<optimized out>, this=0x7ffd816ce430) at /usr/include/boost/function/function_template.hpp:767
#19 RedisAsyncConnection::RAC_DisconnectCallback (c=0x1a552e0, status=-1) at controller/src/analytics/redis_connection.cc:186
#20 0x000000000082461b in __redisAsyncFree (ac=0x1a552e0) at build/third_party/hiredis/src/async.c:261
#21 0x00000000008262f9 in redisBoostClient::handle_read (this=0x1a54a80, ec=...) at build/third_party/hiredis/hiredis-boostasio-adapter/boostasio.cpp:62
#22 0x00000000008269c4 in call<boost::shared_ptr<redisBoostClient>, boost::system::error_code> (b1=<synthetic pointer>, u=..., this=<optimized out>) at /usr/include/boost/bind/mem_fn_template.hpp:156
#23 operator()<boost::shared_ptr<redisBoostClient> > (a1=..., u=..., this=<optimized out>) at /usr/include/boost/bind/mem_fn_template.hpp:171
#24 operator()<boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<const boost::system::error_code&, long unsigned int const&> > (a=<synthetic pointer>, f=...,
this=<optimized out>) at /usr/include/boost/bind/bind.hpp:313
#25 operator()<boost::system::error_code, long unsigned int> (a2=<optimized out>, a1=..., this=<optimized out>) at /usr/include/boost/bind/bind_template.hpp:102
#26 operator() (this=<optimized out>) at /usr/include/boost/asio/detail/bind_handler.hpp:127
#27 asio_handler_invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> >, boost::system::error_code, unsigned long> > (function=...) at /usr/include/boost/asio/handler_invoke_hook.hpp:64
#28 invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> >, boost::system::error_code, unsigned long>, boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> > > (context=..., function=...) at /usr/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#29 boost::asio::detail::reactive_null_buffers_op<boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> > >::do_complete (owner=<optimized out>, base=<optimized out>) at /usr/include/boost/asio/detail/reactive_null_buffers_op.hpp:75
#30 0x00000000006bd6ff in complete (bytes_transferred=0, ec=..., owner=..., this=<optimized out>) at /usr/include/boost/asio/detail/task_io_service_operation.hpp:37
#31 boost::asio::detail::epoll_reactor::descriptor_state::do_complete (owner=0x1a3e170, base=0x1a54ad0, ec=..., bytes_transferred=<optimized out>)
---Type <return> to continue, or q <return> to quit---q
at /usr/include/boost/asio/detail/impl/epoll_reactor.Quit
(gdb)
##################
Step up details:
##################
Multi node cluster contains (3 control + 2 compute) nodes
Contrail Images used to install :
-rw-r--r-- 1 root root 1135603496 Mar 13 19:26 contrail-install-packages_3.2.9.0-72~mitaka_all.deb
Server Manager image used to install :
-rw-r--r-- 1 root root 197316126 Mar 13 19:27 contrail-server-manager-installer_3.2.9.0-72~ubuntu-14-04mitaka_all.deb
root@servermanager:~/sm_files# server-manager show cluster -d
{
"cluster": [
{
"base_image_id": "",
"email": "",
"id": "test-cluster",
"package_image_id": "",
"parameters": {
"domain": "englab.juniper.net",
"provision": {
"contrail": {
"database": {
"minimum_diskGB": 32
},
"enable_lbaas": true,
"kernel_upgrade": true,
"kernel_version": "3.13.0-142",
"xmpp_auth_enable": "true",
"xmpp_dns_auth_enable": "true"
},
"openstack": {
"ceilometer": {
"mongo": "*****",
"password": "*****"
},
"cinder": {
"password": "*****"
},
"glance": {
"password": "*****"
},
"ha": {
"external_vip": "10.0.0.200",
"external_virtual_router_id": 102,
"internal_vip": "10.10.0.200",
"internal_virtual_router_id": 103
},
"heat": {
"encryption_key": "*****",
"password": "*****"
},
"horizon": {
"password": "*****"
},
"keystone": {
"admin_password": "*****",
"admin_token": "*****",
"version": "v2.0"
},
"mysql": {
"root_password": "*****",
"service_password": "*****"
},
"neutron": {
"password": "*****"
},
"nova": {
"password": "*****"
},
"openstack_manage_amqp": "true",
"swift": {
"password": "*****"
}
}
},
"storage_fsid": "7cfe5380-f590-40a7-ab04-5ead1b14e12a",
"storage_virsh_uuid": "4f399257-b50a-4dcb-bcaf-d272e76df0b7",
"uuid": "b09f9027-608d-49d8-b36a-9a8521efe6b3"
},
"provision_role_sequence": "{'completed': [(u'server3', 'keepalived', '2018_03_14__00_15_04'), (u'server2', 'keepalived', '2018_03_14__00_15_46'), (u'server1', 'keepalived', '2018_03_14__00_25_41'), (u'server2', 'haproxy', '2018_03_14__00_26_02'), (u'server3', 'haproxy', '2018_03_14__00_26_02'), (u'server1', 'haproxy', '2018_03_14__00_26_08'), (u'server3', 'database', '2018_03_14__00_27_11'), (u'server2', 'database', '2018_03_14__00_27_12'), (u'server1', 'database', '2018_03_14__00_28_39'), (u'server3', 'openstack', '2018_03_14__00_45_10'), (u'server2', 'openstack', '2018_03_14__00_45_28'), (u'server1', 'openstack', '2018_03_14__00_52_05'), (u'server1', 'pre_exec_vnc_galera', '2018_03_14__00_54_22'), (u'server3', 'pre_exec_vnc_galera', '2018_03_14__00_55_44'), (u'server2', 'pre_exec_vnc_galera', '2018_03_14__00_57_35'), (u'server1', 'post_exec_vnc_galera', '2018_03_14__00_58_21'), (u'server3', 'post_exec_vnc_galera', '2018_03_14__00_58_59'), (u'server2', 'post_exec_vnc_galera', '2018_03_14__00_59_46'), (u'server3', 'config', '2018_03_14__01_02_26'), (u'server2', 'config', '2018_03_14__01_02_27'), (u'server1', 'config', '2018_03_14__01_02_42'), (u'server2', 'control', '2018_03_14__01_03_39'), (u'server3', 'control', '2018_03_14__01_03_41'), (u'server1', 'control', '2018_03_14__01_04_21'), (u'server2', 'collector', '2018_03_14__01_05_36'), (u'server3', 'collector', '2018_03_14__01_05_42'), (u'server1', 'collector', '2018_03_14__01_06_42'), (u'server2', 'webui', '2018_03_14__01_07_35'), (u'server3', 'webui', '2018_03_14__01_07_38'), (u'server1', 'webui', '2018_03_14__01_10_23'), (u'server3', 'post_provision', '2018_03_14__01_10_59'), (u'server2', 'post_provision', '2018_03_14__01_11_31'), (u'server1', 'post_provision', '2018_03_14__01_11_36'), (u'server5', 'compute', '2018_03_14__01_18_34'), (u'server5', 'post_provision', '2018_03_14__01_18_35'), (u'server4', 'compute', '2018_03_14__01_27_27'), (u'server4', 'post_provision', '2018_03_14__01_27_28')], 'steps': []}",
"provisioned_id": null
}
]
}
root@servermanager:~/sm_files#
Logs and Core file location:
/auto/cs-shared/bugs/1755649
We are facing race condition problem. From core dump info.
When state_machine queue deal with resources update, in collector, the generator's vsession is NULL, state_machine_ is NULL and disconnected state is NULL.
The race condition seems like collector receive redis disconnection firstly, and enqueue resource update message to state_machine_ queue, after that, receive session remove. Because session remove was immediately processed. So when queue message trigger callback, in fact, the state_machine has been removed.