R1.10-build-25: Cluster down due to cassandra db service not responding

Bug #1363331 reported by Saravanan Musuvathi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R1.1
Invalid
Critical
Deepinder Setia
Trunk
Invalid
Critical
Deepinder Setia

Bug Description

After leaving the setup overnight with traffic, noticed following
1. nova-conductor log was full of error and grown to 40+G.. Disk space in one node was 100% consumed by these logs..
2. Multiple processes were in Init state, due to Discovery Connection issue..

Even after moving the logs out, setup didnt recover..

Setup is in failure state for debugging.. Request to see why Connections to Discovery is failing.. you can check nodeg6, nodeg7,..

 cfgm, nodeg6.englab.juniper.net (10.204.217.46)
 cfgm, control, nodeg7.englab.juniper.net (10.204.217.47)
 cfgm, control, nodeg8.englab.juniper.net (10.204.217.48)
 compute nodeg9.englab.juniper.net (10.204.217.49)
 compute nodeg22.englab.juniper.net (10.204.217.62)

root@nodeg7:~# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control initializing
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-analytics-api initializing
contrail-analytics-nodemgr active
contrail-collector initializing
contrail-query-engine active

== Contrail Config ==
supervisor-config: active
contrail-api:0 initializing
contrail-config-nodemgr active
contrail-discovery:0 active
contrail-schema initializing
contrail-svc-monitor active

Revision history for this message
Saravanan Musuvathi (smusuvathi) wrote :
Changed in juniperopenstack:
milestone: none → r1.10-fcs
tags: added: blocker
information type: Proprietary → Public
summary: - R1.10-build-25: Multiple processes reporting connection to discovery
- down
+ R1.10-build-25: Cluster down due to error in Discovery process
Revision history for this message
Saravanan Musuvathi (smusuvathi) wrote : Re: R1.10-build-25: Cluster down due to error in Discovery process

Checked the setup with Prakash, the status of discovery process is misleading...it says active, which is not correct.. Also, following error has to be fixed [variable not defined]..

 'REQUEST_METHOD': 'POST',
 'SCRIPT_NAME': '',
 'SERVER_NAME': 'nodeg6.englab.juniper.net',
 'SERVER_PORT': '9110',
 'SERVER_PROTOCOL': 'HTTP/1.0',
 'SERVER_SOFTWARE': 'gevent/1.0 Python/2.7',
 'bottle.app': <bottle.Bottle object at 0x26763d0>,
 'bottle.request': <LocalRequest: POST http://10.204.217.46:5998/subscribe>,
 'bottle.request.body': <StringIO.StringIO instance at 0x7f16710>,
 'bottle.request.headers': <bottle.WSGIHeaderDict object at 0x26b3f050>,
 'bottle.request.json': {u'client': u'nodeg8:Contrail-Analytics-Nodemgr',
                         u'client-type': u'Contrail-Analytics-Nodemgr',
                         u'instances': 2,
                         u'service': u'Collector'},
 'bottle.request.urlparts': SplitResult(scheme='http', netloc='10.204.217.46:5998', path='/subscribe', query='', fragment=''),
 'bottle.route': <POST '/subscribe' <bound method DiscoveryServer.error_handler of <discovery.disc_server.DiscoveryServer instance at 0x277c3b0>>>,
 'route.handle': <POST '/subscribe' <bound method DiscoveryServer.error_handler of <discovery.disc_server.DiscoveryServer instance at 0x277c3b0>>>,
 'route.url_args': {},
 'wsgi.errors': <open file '<stderr>', mode 'w' at 0x7f73d6623270>,
 'wsgi.input': <StringIO.StringIO instance at 0x7f16710>,
 'wsgi.multiprocess': False,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 0)} failed with error

10.204.217.46 - - [2014-09-01 03:29:25] "POST /subscribe HTTP/1.1" socket 0 327.711255
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/bottle.py", line 764, in _handle
    return route.call(**args)
  File "/usr/lib/python2.7/dist-packages/bottle.py", line 1575, in wrapper
    rv = callback(*a, **ka)
  File "/usr/lib/python2.7/dist-packages/discovery/disc_server.py", line 332, in error_handler
    self._debug['503'] += 1
NameError: global name 'self' is not defined
10.204.217.46 - - [2014-09-01 03:29:27] "POST /subscribe HTTP/1.1" 500 890 0.003713
Traceback (most recent call last):

summary: - R1.10-build-25: Cluster down due to error in Discovery process
+ R1.10-build-25: Cluster down due to cassandra db service not responding
Revision history for this message
Saravanan Musuvathi (smusuvathi) wrote :
Download full text (4.4 KiB)

Sequence of events that lead to the problem:

1. In db node, nodeg6, due to following trace, nova logs grew to 30+GB and took complete available disk space..
2. This left cassandra process in a bad state, while the status showed ACTIVE, it was not listening to port 9160
3. The setup was running with 2 DB nodes; With one DB node down, read & write consistency quorum was not met ..
3. Due to above issue, discovery & API services started seeing error [listed in previous post]

Actions to be taken:
a. For R1.10, db nodes should be in odd numbers; In case of upgrade from previous releases, this should be addressed first ..
b. Discovery needs to address the issue in error code path
c. We should show correct state of DB process..

2014-08-29 06:25:02.107 30834 TRACE nova.openstack.common.rpc.common ConnectionError: 530: (NOT_ALLOWED - attempt to reuse consumer tag '1', (60, 20), None)
2014-08-29 06:25:02.107 30834 TRACE nova.openstack.common.rpc.common
2014-08-29 06:25:02.108 30834 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 10.204.217.46:5672
2014-08-29 06:25:02.117 30834 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 10.204.217.46:5672
2014-08-29 06:25:02.118 30834 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: 530: (NOT_ALLOWED - attempt to reuse consumer tag '1', (60, 20), None)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 577, in ensure
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common return method(*args, **kwargs)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 655, in _consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common queues_tail.consume(nowait=False)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 191, in consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common self.queue.consume(*args, callback=_callback, **options)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 595, in consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common nowait=nowait)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 1769, in basic_consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common (60, 21), # Channel.basic_consume_ok
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 69, in wait
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common self.channel_id, allowed_methods)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/p...

Read more...

Revision history for this message
Saravanan Musuvathi (smusuvathi) wrote :

Sundar's analysis:
=================
After further debugging, found the root-cause for the discovery issue.

Discovery was not able to read/write data from Cassandra because one of the Cassandra nodes [nodeg6] was not listening on port 9160. This was due to disk space issue, which was later recovered. However, Cassandra did not recover.
From Cassandra log >> Service exit with a return value of 143. Looks like Cassandra service did not shutdown/exit cleanly. contrail-status showed that database service as active.

From discovery/disc_cassdb.py,
rd_consistency = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
wr_consistency = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
self._disco_cf = pycassa.ColumnFamily(pool, self._disco_cf_name,
                                              read_consistency_level = rd_consistency,
                                              write_consistency_level = wr_consistency)
Since the read and write consistency is set to QUORUM, any read and write to Cassandra fails (the following code in discovery raises disc_exceptions.ServiceUnavailable exception) because Discovery was not able to connect to one of the Cassandra nodes.

# decorator to catch connectivity error
    def cass_error_handler(func):
        def error_handler(*args, **kwargs):
            try:
                return func(*args,**kwargs)
            except (pycassa.pool.AllServersUnavailable,
                    pycassa.pool.MaximumRetryException):
                raise disc_exceptions.ServiceUnavailable()
            except Exception as e:
                raise
        return error_handler

The following condition should be met for write/read to be successful when the the read and write consistency is set to QUORUM.

From Pycassa doc:

Write consistency levels make the following guarantees before reporting success to the client:
    QUORUM Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes
  Read:
    QUORUM Will query all storage nodes and return the record with the most recent timestamp once it has at least a majority of replicas reported. Again, the remaining replicas will be checked in the background.​

In Discovery, the ReplicationFactor is set to number of database nodes.

sys_mgr.create_keyspace(keyspace_name, SIMPLE_STRATEGY,
                                    {'replication_factor': str(num_dbnodes)})

In Musu's setup, there are two database nodes. Therefore, write/read to cassandra failed when one of the cassandra was not responding. Apiserver also shared the same fate as the replication_factor and read/write consistency is same as Discovery.

​Thanks,
Sundar

Revision history for this message
Raj Reddy (rajreddy) wrote :

From 1.10 onwards, it is recommended that we use 3 or more odd-number database nodes. When upgrading from 1.05 to 1.10, if the contrail setup does not have recommended numbers for database nodes, a manual intervention is required.

tags: added: releasenote
tags: removed: blocker
Revision history for this message
Nagabhushana R (bhushana) wrote :

very old issue. Not seen off late.

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Closing as per comment #6. Please open new bug if this happen again in recent images.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.