[Ubuntu havana R1.10 Build 20] Collector Introspect port 8089 is not responding

Bug #1359183 reported by chhandak
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R1.1
Fix Committed
High
chhandak
Trunk
Fix Committed
High
chhandak

Bug Description

With Build 20 op-server introspect port is not responding.
Observed 2 threads in lock state.

Collector.log
---------------------
e-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags
2014-08-20 Wed 02:46:45:414.854 PDT nodeg21 [Thread 140668398003968, Pid 10127]: HttpSendXML Response Mismatch . Expected NodeStatusUVE , Got SandeshUVECacheResp
2014-08-20 Wed 02:46:50:755.787 PDT nodeg21 [Thread 140668398003968, Pid 10127]: HttpSendXML Response Mismatch . Expected NodeStatusUVE , Got SandeshUVECacheResp
2014-08-20 Wed 02:46:59:414.097 PDT nodeg21 [Thread 140668398003968, Pid 10127]: HttpSendXML Response Mismatch . Expected NodeStatusUVE , Got SandeshUVECacheResp
2014-08-20 Wed 02:47:01:299.592 PDT nodeg21 [Thread 140668402202368, Pid 10127]: handle_uve_publish Message: DatabaseUsage Source: nodeg33:Database:Contrail-Database-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags
2014-08-20 Wed 02:47:02:003.598 PDT nodeg21 [Thread 140668398003968, Pid 10127]: handle_uve_publish Message: DatabaseUsage Source: nodec58:Database:Contrail-Database-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags
2014-08-20 Wed 02:47:02:167.250 PDT nodeg21 [Thread 140668402202368, Pid 10127]: handle_uve_publish Message: DatabaseUsage Source: nodeg21:Database:Contrail-Database-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags

(gdb) info threads
  Id Target Id Frame
  5 Thread 0x7fefeaa51700 (LWP 10947) "contrail-collec" 0x00007fefeb9a5ee9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  4 Thread 0x7fefea650700 (LWP 10948) "contrail-collec" 0x00007fefeb9a5ee9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  3 Thread 0x7fefea24f700 (LWP 10949) "contrail-collec" 0x00007fefeb9a5ee9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  2 Thread 0x7fefe9e4e700 (LWP 10950) "contrail-collec" 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1 Thread 0x7fefedfee740 (LWP 10127) "contrail-collec" 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fefecde3065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fefecde2eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00000000005cfa1e in lock (this=0xa4c560) at build/include/tbb/mutex.h:164
#4 acquire (mutex=..., this=0x7fff14e9e8e0) at build/include/tbb/mutex.h:105
#5 scoped_lock (mutex=..., this=0x7fff14e9e8e0) at build/include/tbb/mutex.h:91
#6 HttpSession::AcceptSession (this=0x1e5e250) at controller/src/http/http_session.cc:245
#7 0x00000000005ce0c7 in HttpServer::AcceptSession (this=<optimized out>, session=<optimized out>) at controller/src/http/http_server.cc:38
#8 0x000000000052abf8 in TcpServer::AcceptHandlerInternal (this=0x1db9190, server=..., error=...) at controller/src/io/tcp_server.cc:316
#9 0x00000000005312f3 in operator() (a1=..., p=<optimized out>, this=0x7fff14e9f0c0, a2=...) at build/include/boost/bind/mem_fn_template.hpp:280
#10 operator()<boost::_mfi::mf2<void, TcpServer, boost::intrusive_ptr<TcpServer>, const boost::system::error_code&>, boost::_bi::list1<const boost::system::error_code&> > (a=<synthetic pointer>, f=..., this=0x7fff14e9f0d0) at build/include/boost/bind/bind.hpp:392
#11 operator()<boost::system::error_code> (a1=..., this=0x7fff14e9f0c0) at build/include/boost/bind/bind_template.hpp:47
#12 operator() (this=0x7fff14e9f0c0) at build/include/boost/asio/detail/bind_handler.hpp:46
#13 asio_handler_invoke<boost::asio::detail::binder1<boost::_bi::bind_t<void, boost::_mfi::mf2<void, TcpServer, boost::intrusive_ptr<TcpServer>, boost::system::error_code const&>, boost::_bi::list3<boost::_bi::value<TcpServer*>, boost::_bi::value<boost::intrusive_ptr<TcpServer> >, boost::arg<1> (*)()> >, boost::system::error_code> > (function=...) at build/include/boost/asio/handler_invoke_hook.hpp:64
#14 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder1<boost::_bi::bind_t<void, boost::_mfi::mf2<void, TcpServer, boost::intrusive_ptr<TcpServer>, boost::system::error_code const&>, boost::_bi::list3<boost::_bi::value<TcpServer*>, boost::_bi::value<boost::intrusive_ptr<TcpServer> >, boost::arg<1> (*)()> >, boost::system::error_code>, boost::_bi::bind_t<void, boost::_mfi::mf2<void, TcpServer, boost::intrusive_ptr<TcpServer>, boost::system::error_code const&>, boost::_bi::list3<boost::_bi::value<TcpServer*>, boost::_bi::value<boost::intrusive_ptr<TcpServer> >, boost::arg<1> (*)()> > > (function=..., context=...)
    at build/include/boost/asio/detail/handler_invoke_helpers.hpp:39
#15 0x00000000005313c4 in boost::asio::detail::reactive_socket_accept_op<boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >, boost::asio::ip::tcp, boost::_bi::bind_t<void, boost::_mfi::mf2<void, TcpServer, boost::intrusive_ptr<TcpServer>, boost::system::error_code const&>, boost::_bi::list3<boost::_bi::value<TcpServer*>, boost::_bi::value<boost::intrusive_ptr<TcpServer> >, boost::arg<1> (*)()> > >::do_complete (owner=0x1d9e380,
    base=0x1da95a0) at build/include/boost/asio/detail/reactive_socket_accept_op.hpp:123
#16 0x00000000004b8091 in complete (owner=..., this=0x1db93d0, bytes_transferred=1, ec=...) at build/include/boost/asio/detail/task_io_service_operation.hpp:37
#17 do_run_one (ec=..., private_op_queue=..., this_thread=..., lock=..., this=<optimized out>) at build/include/boost/asio/detail/impl/task_io_service.ipp:366
#18 boost::asio::detail::task_io_service::run (this=0x1d9e380, ec=...) at build/include/boost/asio/detail/impl/task_io_service.ipp:146
#19 0x00000000005239b4 in run (ec=..., this=<optimized out>) at build/include/boost/asio/impl/io_service.ipp:68
#20 EventManager::Run (this=0x1d8aa40) at controller/src/io/event_manager.cc:32
#21 0x000000000041c849 in main (argc=<optimized out>, argv=<optimized out>) at controller/src/analytics/main.cc:365
(gdb) th 2
Ambiguous command "th 2": thbreak, thread.
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fefe9e4e700 (LWP 10950))]
#0 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fefecde3065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fefecde2eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00000000005d06a2 in lock (this=0xa4c560) at build/include/tbb/mutex.h:164
#4 acquire (mutex=..., this=0x7fefe9e4d2b0) at build/include/tbb/mutex.h:105
#5 scoped_lock (mutex=..., this=0x7fefe9e4d2b0) at build/include/tbb/mutex.h:91
#6 HttpSession::OnSessionEvent (this=0x1dcee30, session=<optimized out>, event=TcpSession::CLOSE) at controller/src/http/http_session.cc:265
#7 0x00000000005389bd in operator() (a1=TcpSession::CLOSE, a0=0x1dcee30, this=0x1dceed8) at build/include/boost/function/function_template.hpp:763
#8 TcpSession::CloseInternal (this=0x1dcee30, callObserver=true) at controller/src/io/tcp_session.cc:225
#9 0x000000000053bbff in TcpSession::Send (this=<optimized out>, data=<optimized out>, size=4, sent=<optimized out>) at controller/src/io/tcp_session.cc:282
#10 0x0000000000572376 in HttpSession::SendSession (s=..., data=0x7fefe9e4d980 "61\r\n", size=4, sent=0x0) at controller/src/http/http_session.h:32
#11 0x0000000000572d26 in HttpSendXML (more=true, len=2494,
    buf=0x7fefe02c8ea0 "<NodeStatusUVE type=\"sandesh\"><data type=\"struct\" identifier=\"1\"><NodeStatus><name type=\"string\" identifier=\"1\" key=\"ObjectCollectorInfo\">nodeg21</name><process_status type=\"list\" identifier=\"4\" aggty"..., context=..., name=<optimized out>)
    at tools/sandesh/library/cpp/sandesh_http.cc:160
#12 SandeshHttp::Response (snh=0x7fefe0301200, context=...) at tools/sandesh/library/cpp/sandesh_http.cc:342
#13 0x000000000056b712 in SandeshUVE::Dispatch (this=0x7fefe0301200, sconn=<optimized out>) at tools/sandesh/library/cpp/sandesh.cc:566
#14 0x00000000004616f8 in Send (more=true, ctx=..., seqno=25, data=..., seq=<optimized out>) at build/production/analytics/analytics_types.cpp:2239
#15 SandeshUVEPerTypeMapImpl<NodeStatusUVE, NodeStatus>::SyncUVE (this=0xa49aa0, seqno=0, ctx=..., more=true) at build/include/sandesh/sandesh_uve.h:151
#16 0x00000000005ad148 in SandeshUVECacheReq::HandleRequest (this=0x7fefe0294040) at tools/sandesh/library/cpp/sandesh_uve.cc:62
#17 0x000000000056575a in Sandesh::ProcessRecv (rsnh=0x7fefe0294040) at tools/sandesh/library/cpp/sandesh.cc:393
#18 0x00000000005716ff in operator() (a0=0x7fefe0294040, this=0x7fefe9e4dc40) at build/include/boost/function/function_template.hpp:763
#19 QueueTaskRunner<SandeshRequest*, WorkQueue<SandeshRequest*> >::RunQueue (this=0x7fefe02362e0) at controller/src/base/queue_task.h:53
#20 0x00000000006b8700 in TaskImpl::execute (this=0x7fefe00c7fc0) at controller/src/base/task.cc:224
#21 0x00007fefed298ece in ?? () from /usr/lib/libtbb_debug.so.2
#22 0x00007fefed28fe0b in ?? () from /usr/lib/libtbb_debug.so.2
#23 0x00007fefed28e6f2 in ?? () from /usr/lib/libtbb_debug.so.2
#24 0x00007fefed2893ce in ?? () from /usr/lib/libtbb_debug.so.2
#25 0x00007fefed289270 in ?? () from /usr/lib/libtbb_debug.so.2
#26 0x00007fefecde0e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#27 0x00007fefeb9a9ccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#28 0x0000000000000000 in ?? ()
(gdb) detach
Detaching from program: /root/vizd, process 10127

Setup Details
--------------------
host1 = 'root@10.204.217.61'
host2 = 'root@10.204.217.73'
host3 = 'root@10.204.217.98'
host4 = 'root@10.204.217.72'
host5 = 'root@10.204.217.57'

env.roledefs = {

    'all': [host1, host2, host3, host4, host5],
    'cfgm': [host1, host2],
    'openstack': [host2],
    'control': [host2, host3],
    'compute': [host4, host5],
    'collector': [host1],
    'webui': [host1],
    'database': [host1, host2, host3],
    'build': [host_build],

}

env.hostnames ={
    'all': ['nodeg21', 'nodeg33', 'nodec58', 'nodeg32', 'nodeg17']
}

Revision history for this message
chhandak (chhandak) wrote :

Gcore of collector process along with other logs copied to below location in mayamruga(bhushana/bhu@123)
/home/bhushana/Documents/technical/bugs/1359183
bhushana@mayamruga-E71:~/Documents/technical/bugs/1359183$ ls
api-0-zk.log contrail-analytics-api.log contrail-svc-monitor.log supervisord-analytics.log
api.log contrail-analytics-api-stdout.log contrail-webui.log supervisord-config.log
collector.log contrail-analytics-nodemgr-stderr.log contrail-webui-middleware.log supervisord_contrail_database.log
collector.log.1 contrail-analytics-nodemgr-stdout.log core.10127 supervisord-webui.log
collector.log.10 contrail-api-0.log discovery.log svc-monitor.err
collector.log.2 contrail-collector-stdout.log ifmap.log svc-monitor.log
collector.log.3 contrail-config-nodemgr-stderr.log ifmap-server-console.log svc-monitor.tmp
collector.log.4 contrail-config-nodemgr-stdout.log process_statecontrail-analytics.json svc-monitor-zk.log
collector.log.5 contrail-database-nodemgr-stderr.log process_statecontrail-config.json vnc_openstack.err
collector.log.6 contrail-database-nodemgr-stdout.log process_statecontrail-database.json vnc_openstack.err.1
collector.log.7 contrail-discovery-0.log query-engine.log vnc_openstack.err.2
collector.log.8 contrail-query-engine-stdout.log redis-webui.log vnc_openstack.tmp
collector.log.9 contrail-schema.log schema-zk.log
bhushana@mayamruga-E71:~/Documents/technical/bugs/1359183$ pwd

chhandak (chhandak)
Changed in juniperopenstack:
importance: Undecided → High
Revision history for this message
chhandak (chhandak) wrote :
summary: - [Ubuntu havana Build 20] Collector Introspect port 8089 is not
+ [Ubuntu havana R1.10 Build 20] Collector Introspect port 8089 is not
responding
Revision history for this message
Megh Bhatt (meghb) wrote :
Download full text (12.3 KiB)

Thread4 executing the HttpSession::SendSession() is trying to acquire the HttpSession::mutex_ again - once before calling HttpSession::Send() and in TcpSession::Send() failure, it calls OnSessionEvent() via CloseInternal() which is where it tries again and deadlocks on itself.

root@a6s23:~# gdb /var/tmp/vizd.20 /var/tmp/core.10127
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /var/tmp/vizd.20...done.

warning: core file may not match specified executable file.
[New LWP 10947]
[New LWP 10948]
[New LWP 10949]
[New LWP 10950]
[New LWP 10127]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff14ffe000
Core was generated by `contrail-collector'.
#0 0x00007fefeb9a5ee9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007fefeb9a5ee9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fefed276d32 in ?? () from /usr/lib/libtbb_debug.so.2
#2 0x00007fefed27708d in ?? () from /usr/lib/libtbb_debug.so.2
#3 0x00007fefed289d9f in rml::internal::thread_monitor::commit_wait(rml::internal::thread_monitor::cookie&) () from /usr/lib/libtbb_debug.so.2
#4 0x00007fefed28943d in ?? () from /usr/lib/libtbb_debug.so.2
#5 0x00007fefed289270 in ?? () from /usr/lib/libtbb_debug.so.2
#6 0x00007fefecde0e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7 0x00007fefeb9a9ccd in personality () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x0000000000000000 in ?? ()
(gdb) info threads
  Id Target Id Frame
  5 LWP 10127 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  4 LWP 10950 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  3 LWP 10949 0x00007fefeb9a5ee9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  2 LWP 10948 0x00007fefeb9a5ee9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
* 1 LWP 10947 0x00007fefeb9a5ee9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) thread apply all bt

Thread 5 (LWP 10127):
#0 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fefecde3065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007fefecde2eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00000000005cfa1e in lock (this=0xa4c560) at build/include/tbb/mutex.h:164
#4 acquire (mutex=..., this=0x7fff14e9e8e0) at build/include/tbb/mutex.h:105
#5 scoped_lock (mutex=..., this=0x7fff14e9e8e0) at build/include/tbb/mutex.h:91
#6 HttpSession::AcceptSession (this=0x1e5e250) at controller/src/http/http_session.cc:245
...

Revision history for this message
Megh Bhatt (meghb) wrote : Re: [Bug 1359183] Re: [Ubuntu havana Build 20] Collector Introspect port 8089 is not responding
Download full text (11.6 KiB)

Hi Channdak,
Can you please let us know if this happens on a regular basis running some particular sanity? Does it happen if you are running contrail-status? We know the issue and want to ascertain if this is really a blocker or not.

Thanks

Megh

On Aug 20, 2014, at 9:01 AM, chhandak <email address hidden> wrote:

> URL for bug log location
>
> http://mayamruga.englab.juniper.net/bugs/1359183/
>
> ** Changed in: juniperopenstack/r1.1
> Assignee: (unassigned) => Raj Reddy (rajreddy)
>
> ** Summary changed:
>
> - [Ubuntu havana Build 20] Collector Introspect port 8089 is not responding
> + [Ubuntu havana R1.10 Build 20] Collector Introspect port 8089 is not responding
>
> --
> You received this bug notification because you are a member of Contrail
> Systems engineering, which is subscribed to Juniper Openstack.
> https://bugs.launchpad.net/bugs/1359183
>
> Title:
> [Ubuntu havana R1.10 Build 20] Collector Introspect port 8089 is not
> responding
>
> Status in Juniper Openstack distribution:
> New
> Status in Juniper Openstack r1.1 series:
> New
>
> Bug description:
> With Build 20 op-server introspect port is not responding.
> Observed 2 threads in lock state.
>
> Collector.log
> ---------------------
> e-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags
> 2014-08-20 Wed 02:46:45:414.854 PDT nodeg21 [Thread 140668398003968, Pid 10127]: HttpSendXML Response Mismatch . Expected NodeStatusUVE , Got SandeshUVECacheResp
> 2014-08-20 Wed 02:46:50:755.787 PDT nodeg21 [Thread 140668398003968, Pid 10127]: HttpSendXML Response Mismatch . Expected NodeStatusUVE , Got SandeshUVECacheResp
> 2014-08-20 Wed 02:46:59:414.097 PDT nodeg21 [Thread 140668398003968, Pid 10127]: HttpSendXML Response Mismatch . Expected NodeStatusUVE , Got SandeshUVECacheResp
> 2014-08-20 Wed 02:47:01:299.592 PDT nodeg21 [Thread 140668402202368, Pid 10127]: handle_uve_publish Message: DatabaseUsage Source: nodeg33:Database:Contrail-Database-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags
> 2014-08-20 Wed 02:47:02:003.598 PDT nodeg21 [Thread 140668398003968, Pid 10127]: handle_uve_publish Message: DatabaseUsage Source: nodec58:Database:Contrail-Database-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags
> 2014-08-20 Wed 02:47:02:167.250 PDT nodeg21 [Thread 140668402202368, Pid 10127]: handle_uve_publish Message: DatabaseUsage Source: nodeg21:Database:Contrail-Database-Nodemgr:0 Name: DatabaseUsageInfo Node: database_usage_stats Bad Stat type struct or tags
>
>
> (gdb) info threads
> Id Target Id Frame
> 5 Thread 0x7fefeaa51700 (LWP 10947) "contrail-collec" 0x00007fefeb9a5ee9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
> 4 Thread 0x7fefea650700 (LWP 10948) "contrail-collec" 0x00007fefeb9a5ee9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
> 3 Thread 0x7fefea24f700 (LWP 10949) "contrail-collec" 0x00007fefeb9a5ee9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
> 2 Thread 0x7fefe9e4e700 (LWP 10950) "contrail-collec" 0x00007fefecde789c in __lll_lock_wait () from /lib/x86_64-linux-...

Changed in juniperopenstack:
milestone: r1.10-fcs → none
Revision history for this message
Nagabhushana R (bhushana) wrote :

ht the same issue in control node as part 1360604, need the fix for 1.1.

tags: added: blocker
Revision history for this message
Megh Bhatt (meghb) wrote :
information type: Proprietary → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.