Bug #1363331 “R1.10-build-25: Cluster down due to cassandra db s...” : Bugs : Juniper Openstack

Revision history for this message

Saravanan Musuvathi (smusuvathi) wrote on 2014-08-30:

#1

control_introspect.jpg Edit (103.0 KiB, image/jpeg)

Changed in juniperopenstack:
milestone:	none → r1.10-fcs
tags:	added: blocker

Ashish Ranjan (aranjan-n) on 2014-09-01

information type:

Proprietary → Public

Saravanan Musuvathi (smusuvathi) on 2014-09-01

summary:

- R1.10-build-25: Multiple processes reporting connection to discovery
- down
+ R1.10-build-25: Cluster down due to error in Discovery process

Revision history for this message

Saravanan Musuvathi (smusuvathi) wrote on 2014-09-01: Re: R1.10-build-25: Cluster down due to error in Discovery process

#2

Checked the setup with Prakash, the status of discovery process is misleading...it says active, which is not correct.. Also, following error has to be fixed [variable not defined]..

'REQUEST_METHOD': 'POST',
'SCRIPT_NAME': '',
'SERVER_NAME': 'nodeg6.englab.juniper.net',
'SERVER_PORT': '9110',
'SERVER_PROTOCOL': 'HTTP/1.0',
'SERVER_SOFTWARE': 'gevent/1.0 Python/2.7',
'bottle.app': <bottle.Bottle object at 0x26763d0>,
'bottle.request': <LocalRequest: POST http://10.204.217.46:5998/subscribe>,
'bottle.request.body': <StringIO.StringIO instance at 0x7f16710>,
'bottle.request.headers': <bottle.WSGIHeaderDict object at 0x26b3f050>,
'bottle.request.json': {u'client': u'nodeg8:Contrail-Analytics-Nodemgr',
                         u'client-type': u'Contrail-Analytics-Nodemgr',
                         u'instances': 2,
                         u'service': u'Collector'},
'bottle.request.urlparts': SplitResult(scheme='http', netloc='10.204.217.46:5998', path='/subscribe', query='', fragment=''),
'bottle.route': <POST '/subscribe' <bound method DiscoveryServer.error_handler of <discovery.disc_server.DiscoveryServer instance at 0x277c3b0>>>,
'route.handle': <POST '/subscribe' <bound method DiscoveryServer.error_handler of <discovery.disc_server.DiscoveryServer instance at 0x277c3b0>>>,
'route.url_args': {},
'wsgi.errors': <open file '<stderr>', mode 'w' at 0x7f73d6623270>,
'wsgi.input': <StringIO.StringIO instance at 0x7f16710>,
'wsgi.multiprocess': False,
'wsgi.multithread': False,
'wsgi.run_once': False,
'wsgi.url_scheme': 'http',
'wsgi.version': (1, 0)} failed with error

10.204.217.46 - - [2014-09-01 03:29:25] "POST /subscribe HTTP/1.1" socket 0 327.711255
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/bottle.py", line 764, in _handle
    return route.call(**args)
  File "/usr/lib/python2.7/dist-packages/bottle.py", line 1575, in wrapper
    rv = callback(*a, **ka)
  File "/usr/lib/python2.7/dist-packages/discovery/disc_server.py", line 332, in error_handler
    self._debug['503'] += 1
NameError: global name 'self' is not defined
10.204.217.46 - - [2014-09-01 03:29:27] "POST /subscribe HTTP/1.1" 500 890 0.003713
Traceback (most recent call last):

Checked the setup with Prakash, the status of discovery process is misleading...it says active, which is not correct.. Also, following error has to be fixed [variable not defined]..

'REQUEST_METHOD': 'POST',
 'SCRIPT_NAME': '',
 'SERVER_NAME': 'nodeg6.englab.juniper.net',
 'SERVER_PORT': '9110',
 'SERVER_PROTOCOL': 'HTTP/1.0',
 'SERVER_SOFTWARE': 'gevent/1.0 Python/2.7',
 'bottle.app': <bottle.Bottle object at 0x26763d0>,
 'bottle.request': <LocalRequest: POST http://10.204.217.46:5998/subscribe>,
 'bottle.request.body': <StringIO.StringIO instance at 0x7f16710>,
 'bottle.request.headers': <bottle.WSGIHeaderDict object at 0x26b3f050>,
 'bottle.request.json': {u'client': u'nodeg8:Contrail-Analytics-Nodemgr',
                         u'client-type': u'Contrail-Analytics-Nodemgr',
                         u'instances': 2,
                         u'service': u'Collector'},
 'bottle.request.urlparts': SplitResult(scheme='http', netloc='10.204.217.46:5998', path='/subscribe', query='', fragment=''),
 'bottle.route': <POST '/subscribe' <bound method DiscoveryServer.error_handler of <discovery.disc_server.DiscoveryServer instance at 0x277c3b0>>>,
 'route.handle': <POST '/subscribe' <bound method DiscoveryServer.error_handler of <discovery.disc_server.DiscoveryServer instance at 0x277c3b0>>>,
 'route.url_args': {},
 'wsgi.errors': <open file '<stderr>', mode 'w' at 0x7f73d6623270>,
 'wsgi.input': <StringIO.StringIO instance at 0x7f16710>,
 'wsgi.multiprocess': False,
 'wsgi.multithread': False,
 'wsgi.run_once': False,
 'wsgi.url_scheme': 'http',
 'wsgi.version': (1, 0)} failed with error

10.204.217.46 - - [2014-09-01 03:29:25] "POST /subscribe HTTP/1.1" socket 0 327.711255
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/bottle.py", line 764, in _handle
    return route.call(**args)
  File "/usr/lib/python2.7/dist-packages/bottle.py", line 1575, in wrapper
    rv = callback(*a, **ka)
  File "/usr/lib/python2.7/dist-packages/discovery/disc_server.py", line 332, in error_handler
    self._debug['503'] += 1
NameError: global name 'self' is not defined
10.204.217.46 - - [2014-09-01 03:29:27] "POST /subscribe HTTP/1.1" 500 890 0.003713
Traceback (most recent call last):

Saravanan Musuvathi (smusuvathi) on 2014-09-02

summary:

- R1.10-build-25: Cluster down due to error in Discovery process
+ R1.10-build-25: Cluster down due to cassandra db service not responding

Revision history for this message

Saravanan Musuvathi (smusuvathi) wrote on 2014-09-02:

#3

Download full text (4.4 KiB)

Sequence of events that lead to the problem:

1. In db node, nodeg6, due to following trace, nova logs grew to 30+GB and took complete available disk space..
2. This left cassandra process in a bad state, while the status showed ACTIVE, it was not listening to port 9160
3. The setup was running with 2 DB nodes; With one DB node down, read & write consistency quorum was not met ..
3. Due to above issue, discovery & API services started seeing error [listed in previous post]

Actions to be taken:
a. For R1.10, db nodes should be in odd numbers; In case of upgrade from previous releases, this should be addressed first ..
b. Discovery needs to address the issue in error code path
c. We should show correct state of DB process..

2014-08-29 06:25:02.107 30834 TRACE nova.openstack.common.rpc.common ConnectionError: 530: (NOT_ALLOWED - attempt to reuse consumer tag '1', (60, 20), None)
2014-08-29 06:25:02.107 30834 TRACE nova.openstack.common.rpc.common
2014-08-29 06:25:02.108 30834 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 10.204.217.46:5672
2014-08-29 06:25:02.117 30834 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 10.204.217.46:5672
2014-08-29 06:25:02.118 30834 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: 530: (NOT_ALLOWED - attempt to reuse consumer tag '1', (60, 20), None)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 577, in ensure
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common return method(*args, **kwargs)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 655, in _consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common queues_tail.consume(nowait=False)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 191, in consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common self.queue.consume(*args, callback=_callback, **options)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 595, in consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common nowait=nowait)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 1769, in basic_consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common (60, 21), # Channel.basic_consume_ok
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 69, in wait
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common self.channel_id, allowed_methods)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common File "/usr/lib/p...

Sequence of events that lead to the problem:

1. In db node, nodeg6, due to following trace, nova logs grew to 30+GB and took complete available disk space..
2. This left cassandra process in a bad state, while the status showed ACTIVE, it was not listening to port 9160
3. The setup was running with 2 DB nodes; With one DB node down, read & write consistency quorum was not met ..
3. Due to above issue, discovery & API services started seeing error [listed in previous post]

Actions to be taken:
a. For R1.10, db nodes should be in odd numbers; In case of upgrade from previous releases, this should be addressed first .. 
b. Discovery needs to address the issue in error code path
c. We should show correct state of DB process..

2014-08-29 06:25:02.107 30834 TRACE nova.openstack.common.rpc.common ConnectionError: 530: (NOT_ALLOWED - attempt to reuse consumer tag '1', (60, 20), None)
2014-08-29 06:25:02.107 30834 TRACE nova.openstack.common.rpc.common
2014-08-29 06:25:02.108 30834 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 10.204.217.46:5672
2014-08-29 06:25:02.117 30834 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 10.204.217.46:5672
2014-08-29 06:25:02.118 30834 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: 530: (NOT_ALLOWED - attempt to reuse consumer tag '1', (60, 20), None)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 577, in ensure
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     return method(*args, **kwargs)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 655, in _consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     queues_tail.consume(nowait=False)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 191, in consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     self.queue.consume(*args, callback=_callback, **options)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 595, in consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     nowait=nowait)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 1769, in basic_consume
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     (60, 21),  # Channel.basic_consume_ok
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 69, in wait
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     self.channel_id, allowed_methods)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 237, in _wait_method
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     self.wait()
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 71, in wait
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     return self.dispatch_method(method_sig, args, content)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 88, in dispatch_method
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     return amqp_method(self, args)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 491, in _close
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common     raise ConnectionError(reply_code, reply_text, (class_id, method_id))
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common ConnectionError: 530: (NOT_ALLOWED - attempt to reuse consumer tag '1', (60, 20), None)
2014-08-29 06:25:02.118 30834 TRACE nova.openstack.common.rpc.common
2014-08-29 06:25:02.119 30834 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 10.204.217.46:5672

Revision history for this message

Saravanan Musuvathi (smusuvathi) wrote on 2014-09-02:

#4

Sundar's analysis:
=================
After further debugging, found the root-cause for the discovery issue.

Discovery was not able to read/write data from Cassandra because one of the Cassandra nodes [nodeg6] was not listening on port 9160. This was due to disk space issue, which was later recovered. However, Cassandra did not recover.
From Cassandra log >> Service exit with a return value of 143. Looks like Cassandra service did not shutdown/exit cleanly. contrail-status showed that database service as active.

From discovery/disc_cassdb.py,
rd_consistency = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
wr_consistency = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
self._disco_cf = pycassa.ColumnFamily(pool, self._disco_cf_name,
read_consistency_level = rd_consistency,
write_consistency_level = wr_consistency)
Since the read and write consistency is set to QUORUM, any read and write to Cassandra fails (the following code in discovery raises disc_exceptions.ServiceUnavailable exception) because Discovery was not able to connect to one of the Cassandra nodes.

# decorator to catch connectivity error
    def cass_error_handler(func):
        def error_handler(*args, **kwargs):
            try:
                return func(*args,**kwargs)
            except (pycassa.pool.AllServersUnavailable,
                    pycassa.pool.MaximumRetryException):
                raise disc_exceptions.ServiceUnavailable()
            except Exception as e:
                raise
        return error_handler

The following condition should be met for write/read to be successful when the the read and write consistency is set to QUORUM.

From Pycassa doc:

Write consistency levels make the following guarantees before reporting success to the client:
    QUORUM Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes
  Read:
    QUORUM Will query all storage nodes and return the record with the most recent timestamp once it has at least a majority of replicas reported. Again, the remaining replicas will be checked in the background.

In Discovery, the ReplicationFactor is set to number of database nodes.

sys_mgr.create_keyspace(keyspace_name, SIMPLE_STRATEGY,
{'replication_factor': str(num_dbnodes)})

In Musu's setup, there are two database nodes. Therefore, write/read to cassandra failed when one of the cassandra was not responding. Apiserver also shared the same fate as the replication_factor and read/write consistency is same as Discovery.

Thanks,
Sundar

Sundar's analysis:
=================
After further debugging, found the root-cause for the discovery issue.

Discovery was not able to read/write data from Cassandra because one of the Cassandra nodes [nodeg6] was not listening on port 9160. This was due to disk space issue, which was later recovered. However, Cassandra did not recover.
From Cassandra log >> Service exit with a return value of 143. Looks like Cassandra service did not shutdown/exit cleanly. contrail-status showed that database service as active.

From discovery/disc_cassdb.py,
rd_consistency = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
wr_consistency = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
self._disco_cf = pycassa.ColumnFamily(pool, self._disco_cf_name,
                                              read_consistency_level = rd_consistency,
                                              write_consistency_level = wr_consistency)
Since the read and write consistency is set to QUORUM, any read and write to Cassandra fails (the following code in discovery raises disc_exceptions.ServiceUnavailable exception) because Discovery was not able to connect to one of the Cassandra nodes.

# decorator to catch connectivity error
    def cass_error_handler(func):
        def error_handler(*args, **kwargs):
            try:
                return func(*args,**kwargs)
            except (pycassa.pool.AllServersUnavailable,
                    pycassa.pool.MaximumRetryException):
                raise disc_exceptions.ServiceUnavailable()
            except Exception as e:
                raise
        return error_handler

The following condition should be met for write/read to be successful when the the read and write consistency is set to QUORUM.

From Pycassa doc:

Write consistency levels make the following guarantees before reporting success to the client:
    QUORUM       Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes
  Read:
    QUORUM       Will query all storage nodes and return the record with the most recent timestamp once it has at least a majority of replicas reported. Again, the remaining replicas will be checked in the background.

In Discovery, the ReplicationFactor is set to number of database nodes.

sys_mgr.create_keyspace(keyspace_name, SIMPLE_STRATEGY,
                                    {'replication_factor': str(num_dbnodes)})

In Musu's setup, there are two database nodes. Therefore, write/read to cassandra failed when one of the cassandra was not responding. Apiserver also shared the same fate as the replication_factor and read/write consistency is same as Discovery.

Thanks,
Sundar

Revision history for this message

Raj Reddy (rajreddy) wrote on 2014-09-03:

#5

From 1.10 onwards, it is recommended that we use 3 or more odd-number database nodes. When upgrading from 1.05 to 1.10, if the contrail setup does not have recommended numbers for database nodes, a manual intervention is required.

tags:

added: releasenote

Ashish Ranjan (aranjan-n) on 2014-09-04

tags:

removed: blocker

Revision history for this message

Nagabhushana R (bhushana) wrote on 2016-06-03:

#6

very old issue. Not seen off late.

Revision history for this message

Jeba Paulaiyan (jebap) wrote on 2016-08-15:

#7

Closing as per comment #6. Please open new bug if this happen again in recent images.

	Status	Importance	Assigned to	Milestone
Juniper Openstack	Status tracked in Trunk
R1.1	Invalid	Critical	Deepinder Setia	Juniper Openstack r1.10-fcs
Trunk	Invalid	Critical	Deepinder Setia

Juniper Openstack

R1.10-build-25: Cluster down due to cassandra db service not responding

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches