nova-compute state not updated

Bug #1404241 reported by Pavel Gluschak
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned
oslo.messaging
Incomplete
Undecided
Unassigned

Bug Description

I'm running 2014.2.1 on CentOS7. 1 controller and 5 compute nodes are deployed using packstack.

Whenever I reboot the controller node, some of nova-compute services report state=XXX even after 60 minutes after reboot completed and controller node is up and running again:

[root@juno1 ~(keystone_admin)]# nova-manage service list
Binary Host Zone Status State Updated_At
nova-consoleauth juno1 internal enabled :-) 2014-12-19 13:17:48
nova-scheduler juno1 internal enabled :-) 2014-12-19 13:17:47
nova-conductor juno1 internal enabled :-) 2014-12-19 13:17:47
nova-cert juno1 internal enabled :-) 2014-12-19 13:17:48
nova-compute juno4 nova enabled XXX 2014-12-19 12:26:56
nova-compute juno5 nova enabled :-) 2014-12-19 13:17:47
nova-compute juno6 nova enabled :-) 2014-12-19 13:17:46
nova-compute juno3 nova enabled :-) 2014-12-19 13:17:46
nova-compute juno2 nova enabled XXX 2014-12-19 12:21:52

Here is the chunk of nova-compute log from juno4:

2014-12-19 15:46:02.082 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-12-19 15:46:02.083 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit return method()
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 735, in _consume
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit return self.connection.drain_events(timeout=timeout)
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 281, in drain_events
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit return self.transport.drain_events(self.connection, **kwargs)
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 94, in drain_events
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit return connection.drain_events(**kwargs)
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 299, in drain_events
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit chanmap, None, timeout=timeout,
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 362, in _wait_multiple
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit channel, method_sig, args, content = read_timeout(timeout)
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 333, in read_timeout
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit return self.method_reader.read_method()
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit raise m
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit IOError: Socket closed
2014-12-19 15:46:02.083 5193 TRACE oslo.messaging._drivers.impl_rabbit
2014-12-19 15:46:02.084 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-12-19 15:46:03.084 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
2014-12-19 15:46:03.096 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
2014-12-19 15:46:03.105 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 9.167.185.90:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2014-12-19 15:46:03.105 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 9.167.185.90:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2014-12-19 15:46:04.106 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-12-19 15:46:04.106 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-12-19 15:46:05.106 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
2014-12-19 15:46:05.116 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
2014-12-19 15:46:05.157 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 9.167.185.90:5672
2014-12-19 15:46:05.159 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 9.167.185.90:5672
2014-12-19 15:46:33.229 5193 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2014-12-19 15:46:33.781 5193 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'conductor': [Errno 104] Connection reset by peer
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 655, in ensure
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit return method()
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 752, in _publish
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic=topic, **kwargs)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 378, in __init__
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit **options)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 326, in __init__
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 334, in reconnect
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 83, in __init__
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit self.revive(self._channel)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 214, in revive
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit self.declare()
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 100, in declare
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit self.exchange.declare()
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 163, in declare
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit nowait=nowait, passive=passive,
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 613, in exchange_declare
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit self._send_method((40, 10), args)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit self.channel_id, method_sig, args, content,
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit write_frame(1, channel, payload)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 177, in write_frame
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit frame_type, channel, size, payload, 0xce,
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 359, in sendall
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit tail = self.send(data, flags)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 342, in send
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit total_sent += fd.send(data[total_sent:], flags)
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 104] Connection reset by peer
2014-12-19 15:46:33.781 5193 TRACE oslo.messaging._drivers.impl_rabbit
2014-12-19 15:46:33.782 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-12-19 15:46:34.782 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 9.167.185.90:5672
2014-12-19 15:46:34.796 5193 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 9.167.185.90:5672

I see, that nova-compute lost AMQ connection with controller node because of reboot, but then connection was established again after reboot completed.
I also can manage running instances on that compute node and see correct instance states.

The problem can be solved if I do: systemctl restart openstack-nova-compute on nodes with XXX state. After that I see expected :-)

Revision history for this message
Pavel Gluschak (scsnow) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

It sounds like there may be an issue in oslo.messaging with reconnecting

Changed in nova:
status: New → Incomplete
Revision history for this message
Pavel Gluschak (scsnow) wrote :

If it's oslo issue why it affects only nova-compute service, while neutron's l2 agents running on the same compute nodes report correct state.

Revision history for this message
Pavel Gluschak (scsnow) wrote :

http://docs.openstack.org/openstack-ops/content/maintenance.html

"Also, in practice, the nova-compute services on the compute nodes do not always reconnect cleanly to rabbitmq hosted on the controller when it comes back up after a long reboot; a restart on the nova services on the compute nodes is required."

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

I'm almost sure that this bug doesn't exists anymore since oslo.msg 1.5 or 1.6

Revision history for this message
Doug Hellmann (doug-hellmann) wrote :

Which version of oslo.messaging exhibits this issue?

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
Pavel Gluschak (scsnow) wrote :

[root@juno1 api_tests]# rpm -qa | grep oslo
python-oslo-config-1.4.0-1.el7.centos.noarch
python-oslo-vmware-0.6.0-1.el7.centos.noarch
python-oslo-i18n-1.0.0-1.el7.centos.noarch
python-oslo-rootwrap-1.3.0.0-1.el7.centos.noarch
python-oslo-utils-1.0.0-1.el7.centos.noarch
python-oslo-db-1.0.2-2.el7.centos.noarch
python-oslo-messaging-1.4.1-3.el7.centos.noarch
python-oslo-sphinx-1.1-2.el7.centos.noarch
python-oslo-serialization-0.3.0-1.el7.centos.noarch

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
Ken Giusti (kgiusti) wrote :

This bug is > 180 days without activity. We are removing the assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid please update this bug with the latest observations.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.