contrail-collector crash immediately after provisioning

Bug #1755649 reported by Madhava Jayamani
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.2
Fix Committed
High
Zhiqiang Cui
R4.0
Fix Committed
High
Zhiqiang Cui
R4.1
Fix Committed
High
Zhiqiang Cui
R5.0
Fix Committed
High
Zhiqiang Cui
Trunk
Fix Committed
High
Zhiqiang Cui

Bug Description

contrail-collector crash immediately after provisioning.

root@server3:/var/crashes# gdb vizd core.contrail-collec.24997.server3.1520989531
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from vizd...done.

warning: core file may not match specified executable file.
[New LWP 24997]
[New LWP 25026]
[New LWP 25033]
[New LWP 25036]
[New LWP 25031]
[New LWP 25030]
[New LWP 25034]
[New LWP 25035]
[New LWP 25028]
[New LWP 25032]
[New LWP 25027]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-collector'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 size (this=0x1ae7fd0) at /usr/include/c++/4.8/bits/basic_string.h:716
716 { return _M_rep()->_M_length; }
(gdb) bt
#0 size (this=0x1ae7fd0) at /usr/include/c++/4.8/bits/basic_string.h:716
#1 compare (__str=..., this=0x1ae7fd0) at /usr/include/c++/4.8/bits/basic_string.h:2227
#2 operator< <char, std::char_traits<char>, std::allocator<char> > (__rhs=..., __lhs=<error reading variable: Cannot access memory at address 0x55656372756f734d>)
    at /usr/include/c++/4.8/bits/basic_string.h:2571
#3 operator() (this=<optimized out>, __y=..., __x=<error reading variable: Cannot access memory at address 0x55656372756f734d>) at /usr/include/c++/4.8/bits/stl_function.h:235
#4 _M_lower_bound (this=0x1ae7b10, __k="ssm::EvResourceUpdate", __y=<optimized out>, __x=0x1ae7fb0) at /usr/include/c++/4.8/bits/stl_tree.h:1141
#5 std::_Rb_tree<std::string, std::pair<std::string const, void*>, std::_Select1st<std::pair<std::string const, void*> >, std::less<std::string>, std::allocator<std::pair<std::string const, void*> > >::find (this=this@entry=0x1ae7b10, __k="ssm::EvResourceUpdate") at /usr/include/c++/4.8/bits/stl_tree.h:1792
#6 0x00000000007b02ef in find (__x="ssm::EvResourceUpdate", this=this@entry=0x1ae7b10) at /usr/include/c++/4.8/bits/stl_map.h:822
#7 find (x="ssm::EvResourceUpdate", this=this@entry=0x1ae7b10) at /usr/include/boost/ptr_container/ptr_map_adapter.hpp:278
#8 SandeshEventStatistics::Update (this=this@entry=0x1ae7b10, event_name="ssm::EvResourceUpdate", enqueue=enqueue@entry=true, fail=fail@entry=false)
    at tools/sandesh/library/cpp/sandesh_statistics.cc:269
#9 0x000000000079a46c in SandeshStateMachine::UpdateEventStats (this=this@entry=0x1ae7820, event=..., enqueue=enqueue@entry=true, fail=fail@entry=false)
    at tools/sandesh/library/cpp/sandesh_state_machine.cc:783
#10 0x00000000007a4025 in UpdateEventEnqueue (event=..., this=0x1ae7820) at tools/sandesh/library/cpp/sandesh_state_machine.cc:764
#11 SandeshStateMachine::Enqueue<ssm::EvResourceUpdate> (this=0x1ae7820, event=...) at tools/sandesh/library/cpp/sandesh_state_machine.cc:853
#12 0x000000000079a93a in SandeshStateMachine::ResourceUpdate (this=<optimized out>, rsc=rsc@entry=false) at tools/sandesh/library/cpp/sandesh_state_machine.cc:734
#13 0x00000000005f584e in Collector::RedisUpdate (this=0x1a56770, rsc=rsc@entry=false) at controller/src/analytics/collector.cc:127
#14 0x000000000066e113 in RedisUpdate (rsc=false, this=0x7ffd816cf290) at controller/src/analytics/viz_collector.h:78
#15 OpServerProxy::OpServerImpl::ToOpsConnDown (this=0x1a4edd0) at controller/src/analytics/OpServerProxy.cc:345
#16 0x000000000060b4c6 in operator() (this=0x1a54718) at /usr/include/boost/function/function_template.hpp:767
#17 RedisAsyncConnection::RAC_DisconnectCallbackProcess (this=0x1a54620, c=<optimized out>, status=<optimized out>) at controller/src/analytics/redis_connection.cc:163
#18 0x0000000000609b0d in operator() (a1=-1, a0=<optimized out>, this=0x7ffd816ce430) at /usr/include/boost/function/function_template.hpp:767
#19 RedisAsyncConnection::RAC_DisconnectCallback (c=0x1a552e0, status=-1) at controller/src/analytics/redis_connection.cc:186
#20 0x000000000082461b in __redisAsyncFree (ac=0x1a552e0) at build/third_party/hiredis/src/async.c:261
#21 0x00000000008262f9 in redisBoostClient::handle_read (this=0x1a54a80, ec=...) at build/third_party/hiredis/hiredis-boostasio-adapter/boostasio.cpp:62
#22 0x00000000008269c4 in call<boost::shared_ptr<redisBoostClient>, boost::system::error_code> (b1=<synthetic pointer>, u=..., this=<optimized out>) at /usr/include/boost/bind/mem_fn_template.hpp:156
#23 operator()<boost::shared_ptr<redisBoostClient> > (a1=..., u=..., this=<optimized out>) at /usr/include/boost/bind/mem_fn_template.hpp:171
#24 operator()<boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<const boost::system::error_code&, long unsigned int const&> > (a=<synthetic pointer>, f=...,
    this=<optimized out>) at /usr/include/boost/bind/bind.hpp:313
#25 operator()<boost::system::error_code, long unsigned int> (a2=<optimized out>, a1=..., this=<optimized out>) at /usr/include/boost/bind/bind_template.hpp:102
#26 operator() (this=<optimized out>) at /usr/include/boost/asio/detail/bind_handler.hpp:127
#27 asio_handler_invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> >, boost::system::error_code, unsigned long> > (function=...) at /usr/include/boost/asio/handler_invoke_hook.hpp:64
#28 invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> >, boost::system::error_code, unsigned long>, boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> > > (context=..., function=...) at /usr/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#29 boost::asio::detail::reactive_null_buffers_op<boost::_bi::bind_t<void, boost::_mfi::mf1<void, redisBoostClient, boost::system::error_code>, boost::_bi::list2<boost::_bi::value<boost::shared_ptr<redisBoostClient> >, boost::arg<1> (*)()> > >::do_complete (owner=<optimized out>, base=<optimized out>) at /usr/include/boost/asio/detail/reactive_null_buffers_op.hpp:75
#30 0x00000000006bd6ff in complete (bytes_transferred=0, ec=..., owner=..., this=<optimized out>) at /usr/include/boost/asio/detail/task_io_service_operation.hpp:37
#31 boost::asio::detail::epoll_reactor::descriptor_state::do_complete (owner=0x1a3e170, base=0x1a54ad0, ec=..., bytes_transferred=<optimized out>)
---Type <return> to continue, or q <return> to quit---q
 at /usr/include/boost/asio/detail/impl/epoll_reactor.Quit
(gdb)

##################
Step up details:
##################

Multi node cluster contains (3 control + 2 compute) nodes

Contrail Images used to install :
-rw-r--r-- 1 root root 1135603496 Mar 13 19:26 contrail-install-packages_3.2.9.0-72~mitaka_all.deb
Server Manager image used to install :
-rw-r--r-- 1 root root 197316126 Mar 13 19:27 contrail-server-manager-installer_3.2.9.0-72~ubuntu-14-04mitaka_all.deb

root@servermanager:~/sm_files# server-manager show cluster -d
{
    "cluster": [
        {
            "base_image_id": "",
            "email": "",
            "id": "test-cluster",
            "package_image_id": "",
            "parameters": {
                "domain": "englab.juniper.net",
                "provision": {
                    "contrail": {
                        "database": {
                            "minimum_diskGB": 32
                        },
                        "enable_lbaas": true,
                        "kernel_upgrade": true,
                        "kernel_version": "3.13.0-142",
                        "xmpp_auth_enable": "true",
                        "xmpp_dns_auth_enable": "true"
                    },
                    "openstack": {
                        "ceilometer": {
                            "mongo": "*****",
                            "password": "*****"
                        },
                        "cinder": {
                            "password": "*****"
                        },
                        "glance": {
                            "password": "*****"
                        },
                        "ha": {
                            "external_vip": "10.0.0.200",
                            "external_virtual_router_id": 102,
                            "internal_vip": "10.10.0.200",
                            "internal_virtual_router_id": 103
                        },
                        "heat": {
                            "encryption_key": "*****",
                            "password": "*****"
                        },
                        "horizon": {
                            "password": "*****"
                        },
                        "keystone": {
                            "admin_password": "*****",
                            "admin_token": "*****",
                            "version": "v2.0"
                        },
                        "mysql": {
                            "root_password": "*****",
                            "service_password": "*****"
                        },
                        "neutron": {
                            "password": "*****"
                        },
                        "nova": {
                            "password": "*****"
                        },
                        "openstack_manage_amqp": "true",
                        "swift": {
                            "password": "*****"
                        }
                    }
                },
                "storage_fsid": "7cfe5380-f590-40a7-ab04-5ead1b14e12a",
                "storage_virsh_uuid": "4f399257-b50a-4dcb-bcaf-d272e76df0b7",
                "uuid": "b09f9027-608d-49d8-b36a-9a8521efe6b3"
            },
            "provision_role_sequence": "{'completed': [(u'server3', 'keepalived', '2018_03_14__00_15_04'), (u'server2', 'keepalived', '2018_03_14__00_15_46'), (u'server1', 'keepalived', '2018_03_14__00_25_41'), (u'server2', 'haproxy', '2018_03_14__00_26_02'), (u'server3', 'haproxy', '2018_03_14__00_26_02'), (u'server1', 'haproxy', '2018_03_14__00_26_08'), (u'server3', 'database', '2018_03_14__00_27_11'), (u'server2', 'database', '2018_03_14__00_27_12'), (u'server1', 'database', '2018_03_14__00_28_39'), (u'server3', 'openstack', '2018_03_14__00_45_10'), (u'server2', 'openstack', '2018_03_14__00_45_28'), (u'server1', 'openstack', '2018_03_14__00_52_05'), (u'server1', 'pre_exec_vnc_galera', '2018_03_14__00_54_22'), (u'server3', 'pre_exec_vnc_galera', '2018_03_14__00_55_44'), (u'server2', 'pre_exec_vnc_galera', '2018_03_14__00_57_35'), (u'server1', 'post_exec_vnc_galera', '2018_03_14__00_58_21'), (u'server3', 'post_exec_vnc_galera', '2018_03_14__00_58_59'), (u'server2', 'post_exec_vnc_galera', '2018_03_14__00_59_46'), (u'server3', 'config', '2018_03_14__01_02_26'), (u'server2', 'config', '2018_03_14__01_02_27'), (u'server1', 'config', '2018_03_14__01_02_42'), (u'server2', 'control', '2018_03_14__01_03_39'), (u'server3', 'control', '2018_03_14__01_03_41'), (u'server1', 'control', '2018_03_14__01_04_21'), (u'server2', 'collector', '2018_03_14__01_05_36'), (u'server3', 'collector', '2018_03_14__01_05_42'), (u'server1', 'collector', '2018_03_14__01_06_42'), (u'server2', 'webui', '2018_03_14__01_07_35'), (u'server3', 'webui', '2018_03_14__01_07_38'), (u'server1', 'webui', '2018_03_14__01_10_23'), (u'server3', 'post_provision', '2018_03_14__01_10_59'), (u'server2', 'post_provision', '2018_03_14__01_11_31'), (u'server1', 'post_provision', '2018_03_14__01_11_36'), (u'server5', 'compute', '2018_03_14__01_18_34'), (u'server5', 'post_provision', '2018_03_14__01_18_35'), (u'server4', 'compute', '2018_03_14__01_27_27'), (u'server4', 'post_provision', '2018_03_14__01_27_28')], 'steps': []}",
            "provisioned_id": null
        }
    ]
}
root@servermanager:~/sm_files#

Logs and Core file location:

/auto/cs-shared/bugs/1755649

description: updated
Sachin Bansal (sbansal)
Changed in juniperopenstack:
assignee: nobody → Sundaresan Rajangam (srajanga)
Jeba Paulaiyan (jebap)
tags: added: analytics
Revision history for this message
Zhiqiang Cui (zcui) wrote :

We are facing race condition problem. From core dump info.
When state_machine queue deal with resources update, in collector, the generator's vsession is NULL, state_machine_ is NULL and disconnected state is NULL.

The race condition seems like collector receive redis disconnection firstly, and enqueue resource update message to state_machine_ queue, after that, receive session remove. Because session remove was immediately processed. So when queue message trigger callback, in fact, the state_machine has been removed.

Revision history for this message
Zhiqiang Cui (zcui) wrote :

After more investigation, I can ensure root cause. Current problem:
1. To state_machine_(scoped_ptr), we use internal parameter deleted_ to block some action after state_machine_ was deleted. But only when destructed function was called, deleted_ can be set to true. This is a bug. After state_machine_ was released, any internal internal parameter can be changed.

2. When state_machine_ was destroyed, do not stop all actions related to state_machine_.

Need find a way to stop all actions but not use deleted_ to block actions.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/41242
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/41242
Committed: http://github.com/Juniper/contrail-sandesh/commit/bbfbb6a5978f5167702cad8ae7a49c80c6630a35
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit bbfbb6a5978f5167702cad8ae7a49c80c6630a35
Author: zcui <email address hidden>
Date: Fri Mar 30 08:54:16 2018 -0700

contrail-collector crash immediately after provisioning

root cause:
To state_machine_, sandesh_connection is owner as scoped_ptr.
generator as user use state_machine_.get() to access. sandesh
_connection will deal with connection close message in one core
but generator will deal with redis message in another core. This
lead source race condition. We have use mutex lock to protect.
but mutex is one part of state_machine_ structure, so to
destructed fucntion of state_machine_, protecting is invalid.
Solution:
Change scoped_ptr to shared_ptr.

Change-Id: I1756f8dc0cdd1e7b705af9f1650c6bb8b118e212
Partial-Bug: 1755649

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/42645
Submitter: Zhiqiang Cui (<email address hidden>)

Jim Reilly (jpreilly)
information type: Proprietary → Private
tags: added: att-aic-contrail
information type: Private → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/43507
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/43549
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/43507
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/43549
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.1

Review in progress for https://review.opencontrail.org/43761
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/43762
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/43763
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/43762
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/43763
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.1

Review in progress for https://review.opencontrail.org/43761
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/43763
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/43762
Committed: http://github.com/Juniper/contrail-common/commit/b8a8de2a2ef2db849d96f8bc2cd983e4275b6b53
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit b8a8de2a2ef2db849d96f8bc2cd983e4275b6b53
Author: zcui <email address hidden>
Date: Tue Jun 12 16:18:06 2018 -0700

contrail-collector crash immediately after provisioning

root cause:
Race condition problem:
To state_machine_,
(1) alloced by sandesh_connection.
(2) used by generator
When problem happen, generator receive Resource update message,
and enqueue resouece update to state_machine_, at same time,
update stats immedietly. This action will try to get mutex
sometime, it will lead CPU yield. We call this as thread 1.
At same time, connection close is triggered, and destructor
function will be triggered. And destructure will call termial
and all memory will be released related to this connection.
We call this as thread 2.
When thread 2 finished and thread 1 go ahead, crash will happen.

Solution:
Designer of state_machine should consider this problem. So state
Machine destructure is separated two steps:
(1) call terminal to free memory alloced by its substruct.
(2) start a timer to free state machine self.
Between step1 and step2, deleted_ is used to check state machine
can be used or not.
We add a shutdown fucntion for stats structure to pass this state.

Change-Id: I15db0a1c1a6999758ed5cd2400d5d3ff8ab85232
Closes-Bug: 1755649

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/43761
Committed: http://github.com/Juniper/contrail-sandesh/commit/3be6ccb47163040e34fce049dd6d7e21e4f9dea9
Submitter: Zuul (<email address hidden>)
Branch: R4.1

commit 3be6ccb47163040e34fce049dd6d7e21e4f9dea9
Author: zcui <email address hidden>
Date: Mon Jun 4 14:04:43 2018 -0700

contrail-collector crash immediately after provisioning

root cause:
Race condition problem:
To state_machine_,
(1) alloced by sandesh_connection.
(2) used by generator
When problem happen, generator receive Resource update message,
and enqueue resouece update to state_machine_, at same time,
update stats immedietly. This action will try to get mutex
sometime, it will lead CPU yield. We call this as thread 1.
At same time, connection close is triggered, and destructor
function will be triggered. And destructure will call termial
and all memory will be released related to this connection.
We call this as thread 2.
When thread 2 finished and thread 1 go ahead, crash will happen.

Solution:
Designer of state_machine should consider this problem. So state
Machine destructure is separated two steps:
(1) call terminal to free memory alloced by its substruct.
(2) start a timer to free state machine self.
Between step1 and step2, deleted_ is used to check state machine
can be used or not.
We add a shutdown fucntion for stats structure to pass this state.

Closes-Bug: 1755649

Change-Id: I599461f0a37adc21d2b68a5ca20d66ccaf4f6e51

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/43549
Committed: http://github.com/Juniper/contrail-sandesh/commit/8e9f02f85446e92236031c2f7bce2b22fca6e977
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 8e9f02f85446e92236031c2f7bce2b22fca6e977
Author: zcui <email address hidden>
Date: Mon Jun 4 14:04:43 2018 -0700

contrail-collector crash immediately after provisioning

root cause:
Race condition problem:
To state_machine_,
(1) alloced by sandesh_connection.
(2) used by generator
When problem happen, generator receive Resource update message,
and enqueue resouece update to state_machine_, at same time,
update stats immedietly. This action will try to get mutex
sometime, it will lead CPU yield. We call this as thread 1.
At same time, connection close is triggered, and destructor
function will be triggered. And destructure will call termial
and all memory will be released related to this connection.
We call this as thread 2.
When thread 2 finished and thread 1 go ahead, crash will happen.

Solution:
Designer of state_machine should consider this problem. So state
Machine destructure is separated two steps:
(1) call terminal to free memory alloced by its substruct.
(2) start a timer to free state machine self.
Between step1 and step2, deleted_ is used to check state machine
can be used or not.
We add a shutdown fucntion for stats structure to pass this state.

Closes-Bug: 1755649

Change-Id: I599461f0a37adc21d2b68a5ca20d66ccaf4f6e51

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/43507
Committed: http://github.com/Juniper/contrail-sandesh/commit/abd0158c44c4cbec214870a5af5ce0058a2e6407
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit abd0158c44c4cbec214870a5af5ce0058a2e6407
Author: zcui <email address hidden>
Date: Mon Jun 4 14:04:43 2018 -0700

contrail-collector crash immediately after provisioning

root cause:
Race condition problem:
To state_machine_,
(1) alloced by sandesh_connection.
(2) used by generator
When problem happen, generator receive Resource update message,
and enqueue resouece update to state_machine_, at same time,
update stats immedietly. This action will try to get mutex
sometime, it will lead CPU yield. We call this as thread 1.
At same time, connection close is triggered, and destructor
function will be triggered. And destructure will call termial
and all memory will be released related to this connection.
We call this as thread 2.
When thread 2 finished and thread 1 go ahead, crash will happen.

Solution:
Designer of state_machine should consider this problem. So state
Machine destructure is separated two steps:
(1) call terminal to free memory alloced by its substruct.
(2) start a timer to free state machine self.
Between step1 and step2, deleted_ is used to check state machine
can be used or not.
We add a shutdown fucntion for stats structure to pass this state.

Closes-Bug: 1755649

Change-Id: I599461f0a37adc21d2b68a5ca20d66ccaf4f6e51

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/43763
Submitter: Zhiqiang Cui (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/43763
Committed: http://github.com/Juniper/contrail-common/commit/76997a86307ff70469eac51910fe18ec544fbfec
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit 76997a86307ff70469eac51910fe18ec544fbfec
Author: zcui <email address hidden>
Date: Tue Jun 12 16:18:06 2018 -0700

contrail-collector crash immediately after provisioning

root cause:
Race condition problem:
To state_machine_,
(1) alloced by sandesh_connection.
(2) used by generator
When problem happen, generator receive Resource update message,
and enqueue resouece update to state_machine_, at same time,
update stats immedietly. This action will try to get mutex
sometime, it will lead CPU yield. We call this as thread 1.
At same time, connection close is triggered, and destructor
function will be triggered. And destructure will call termial
and all memory will be released related to this connection.
We call this as thread 2.
When thread 2 finished and thread 1 go ahead, crash will happen.

Solution:
Designer of state_machine should consider this problem. So state
Machine destructure is separated two steps:
(1) call terminal to free memory alloced by its substruct.
(2) start a timer to free state machine self.
Between step1 and step2, deleted_ is used to check state machine
can be used or not.
We add a shutdown fucntion for stats structure to pass this state.

Change-Id: I15db0a1c1a6999758ed5cd2400d5d3ff8ab85232
Closes-Bug: 1755649
(cherry picked from commit b8a8de2a2ef2db849d96f8bc2cd983e4275b6b53)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.