rbd failure to initialize during n-cpu startup breaks n-cpu

Bug #1451891 reported by Eric Harney
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Wishlist
Unassigned

Bug Description

CONF.libvirt.images_type is 'rbd' but rbd is not currently functional on this machine. The result is that the rbd library hangs n-cpu for a five minutes at startup, causing AMQP connections to fail.

Cinder recently addressed a similar problem in bug 1401335 by moving rbd library calls into a separate thread via eventlet.tpool. This also seems to help here.

2015-05-05 11:03:02.161 DEBUG nova.openstack.common.service [req-a34db69c-4189-48bb-b409-f5c9e7057130 Non
e None] ******************************************************************************** from (pid=14406)
 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2197
2015-05-05 11:03:02.162 INFO nova.service [-] Starting compute node (version 2015.1)
2015-05-05 11:03:02.164 DEBUG nova.virt.libvirt.host [-] Starting native event thread from (pid=14406) _i
nit_events /opt/stack/nova/nova/virt/libvirt/host.py:446
2015-05-05 11:03:02.165 DEBUG nova.virt.libvirt.host [-] Starting green dispatch thread from (pid=14406)
_init_events /opt/stack/nova/nova/virt/libvirt/host.py:452
2015-05-05 11:03:02.166 DEBUG nova.virt.libvirt.host [-] Connecting to libvirt: qemu:///system from (pid=
14406) _get_new_connection /opt/stack/nova/nova/virt/libvirt/host.py:457
2015-05-05 11:03:02.225 INFO nova.virt.libvirt.driver [-] Connection event '1' reason 'None'
2015-05-05 11:03:02.278 DEBUG nova.virt.libvirt.host [req-0e133ef3-90f7-4441-909f-9f3be8bda496 None None]
 Registering for lifecycle events <nova.virt.libvirt.host.Host object at 0x61d6e90> from (pid=14406) _get
_new_connection /opt/stack/nova/nova/virt/libvirt/host.py:475
2015-05-05 11:03:02.282 DEBUG nova.virt.libvirt.host [req-0e133ef3-90f7-4441-909f-9f3be8bda496 None None]
 Registering for connection events: <nova.virt.libvirt.host.Host object at 0x61d6e90> from (pid=14406) _g
et_new_connection /opt/stack/nova/nova/virt/libvirt/host.py:486
2015-05-05 11:03:02.292 INFO nova.virt.libvirt.host [req-0e133ef3-90f7-4441-909f-9f3be8bda496 None None]
Libvirt host capabilities <capabilities>

  <host>
    <uuid>a7d47f19-9cda-2206-fad2-bdf40da6186b</uuid>
    <cpu>
      <arch>x86_64</arch>
      <model>cpu64-rhel6</model>
      <vendor>Intel</vendor>
      <topology sockets='4' cores='1' threads='1'/>
      <feature name='hypervisor'/>
    </cpu>
    <power_management>
      <suspend_disk/>
      <suspend_hybrid/>
    </power_management>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='1'>
        <cell id='0'>
          <memory unit='KiB'>6974268</memory>
          <cpus num='4'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
            <cpu id='1' socket_id='1' core_id='0' siblings='1'/>
            <cpu id='2' socket_id='2' core_id='0' siblings='2'/>
            <cpu id='3' socket_id='3' core_id='0' siblings='3'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    <secmodel>
      <model>selinux</model>
      <doi>0</doi>
    </secmodel>
    <secmodel>
      <model>dac</model>
      <doi>0</doi>
    </secmodel>
  </host>

  <guest>
    <os_type>hvm</os_type>
    <arch name='i686'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu-system-i386</emulator>
      <machine canonical='pc-i440fx-1.6' maxCpus='255'>pc</machine>
      <machine maxCpus='255'>pc-q35-1.4</machine>
      <machine maxCpus='255'>pc-q35-1.5</machine>
      <machine canonical='pc-q35-1.6' maxCpus='255'>q35</machine>
      <machine maxCpus='1'>isapc</machine>
      <machine maxCpus='255'>pc-0.10</machine>
      <machine maxCpus='255'>pc-0.11</machine>
      <machine maxCpus='255'>pc-0.12</machine>
      <machine maxCpus='255'>pc-0.13</machine>
      <machine maxCpus='255'>pc-0.14</machine>
      <machine maxCpus='255'>pc-0.15</machine>
      <machine maxCpus='255'>pc-1.0</machine>
      <machine maxCpus='255'>pc-1.1</machine>
      <machine maxCpus='255'>pc-1.2</machine>
      <machine maxCpus='255'>pc-1.3</machine>
      <machine maxCpus='255'>pc-i440fx-1.4</machine>
      <machine maxCpus='255'>pc-i440fx-1.5</machine>
      <machine maxCpus='1'>none</machine>
      <domain type='qemu'>
      </domain>
    </arch>
    <features>
      <cpuselection/>
      <deviceboot/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
      <pae/>
      <nonpae/>
    </features>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='x86_64'>
      <wordsize>64</wordsize>
      <emulator>/usr/bin/qemu-system-x86_64</emulator>
      <machine canonical='pc-i440fx-1.6' maxCpus='255'>pc</machine>
      <machine maxCpus='255'>pc-q35-1.4</machine>
      <machine maxCpus='255'>pc-q35-1.5</machine>
      <machine canonical='pc-q35-1.6' maxCpus='255'>q35</machine>
      <machine maxCpus='1'>isapc</machine>
      <machine maxCpus='255'>pc-0.10</machine>
      <machine maxCpus='255'>pc-0.11</machine>
      <machine maxCpus='255'>pc-0.12</machine>
      <machine maxCpus='255'>pc-0.13</machine>
      <machine maxCpus='255'>pc-0.14</machine>
      <machine maxCpus='255'>pc-0.15</machine>
      <machine maxCpus='255'>pc-1.0</machine>
      <machine maxCpus='255'>pc-1.1</machine>
      <machine maxCpus='255'>pc-1.2</machine>
      <machine maxCpus='255'>pc-1.3</machine>
      <machine maxCpus='255'>pc-i440fx-1.4</machine>
      <machine maxCpus='255'>pc-i440fx-1.5</machine>
      <machine maxCpus='1'>none</machine>
      <domain type='qemu'>
      </domain>
    </arch>
    <features>
      <cpuselection/>
      <deviceboot/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
    </features>
  </guest>

</capabilities>

2015-05-05 11:03:02.296 DEBUG nova.virt.libvirt.config [req-0e133ef3-90f7-4441-909f-9f3be8bda496 None Non
e] Generated XML ('<cpu>\n <arch>x86_64</arch>\n <model>cpu64-rhel6</model>\n <vendor>Intel</vendor>\n
  <topology sockets="4" cores="1" threads="1"/>\n <feature name="hypervisor"/>\n</cpu>\n',) from (pid=1
4406) to_xml /opt/stack/nova/nova/virt/libvirt/config.py:82
2015-05-05 11:03:02.560 INFO nova.compute.resource_tracker [req-0e133ef3-90f7-4441-909f-9f3be8bda496 None
 None] Auditing locally available compute resources for node devstacktest
2015-05-05 11:08:02.606 INFO nova.virt.libvirt.driver [-] Connection event '0' reason 'Connection to libv
irt lost: 1'
2015-05-05 11:08:02.612 ERROR oslo_messaging._drivers.impl_qpid [req-fd8a2988-cb21-4e98-81d9-c089b93ce009
 None None] Failed to publish message to topic 'conductor': heartbeat timeout
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid Traceback (most recent call last):
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
oslo_messaging/_drivers/impl_qpid.py", line 614, in ensure
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid return method()
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
oslo_messaging/_drivers/impl_qpid.py", line 710, in _publisher_send
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid **kwargs)
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
oslo_messaging/_drivers/impl_qpid.py", line 420, in __init__
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid super(TopicPublisher, self).__init__(
conf, session, node_name)
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
oslo_messaging/_drivers/impl_qpid.py", line 352, in __init__
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid self.reconnect(session)
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
oslo_messaging/_drivers/impl_qpid.py", line 356, in reconnect
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid self.sender = session.sender(self.add
ress)
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "<string>", line 6, in sender
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
qpid/messaging/endpoints.py", line 606, in sender
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid sender._ewait(lambda: sender.linked)
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
qpid/messaging/endpoints.py", line 814, in _ewait
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid result = self.session._ewait(lambda:
self.error or predicate(), timeout)
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
qpid/messaging/endpoints.py", line 580, in _ewait
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid result = self.connection._ewait(lambd
a: self.error or predicate(), timeout)
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
qpid/messaging/endpoints.py", line 219, in _ewait
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid self.check_error()
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid File "/usr/lib/python2.7/site-packages/
qpid/messaging/endpoints.py", line 212, in check_error
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid raise e
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid HeartbeatTimeout: heartbeat timeout
2015-05-05 11:08:02.612 TRACE oslo_messaging._drivers.impl_qpid
2015-05-05 11:08:02.653 ERROR oslo_messaging._drivers.impl_qpid [req-fd8a2988-cb21-4e98-81d9-c089b93ce009
 None None] Unable to connect to AMQP server on {'username': '', 'host': '172.16.0.10:5672', 'password':
''}: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/qpid/messaging/driver.py", line 504, in dispatch
    self.connect()
  File "/usr/lib/python2.7/site-packages/qpid/messaging/driver.py", line 531, in connect
    self._transport = trans(self.connection, host, port)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/transports.py", line 28, in __init__
    self.socket = connect(host, port)
  File "/usr/lib/python2.7/site-packages/qpid/util.py", line 66, in connect
    sock = socket.socket(af, socktype, proto)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio.py", line 130, in __init__
    notify_opened(fd.fileno())
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 189, in notify_opened
    hub.mark_as_reopened(fd)
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 241, in mark_as_reopened
    self._obsolete(fileno)
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 206, in _obsolete
    listener.defang()
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 73, in defang
    self.mark_as_closed()
TypeError: <lambda>() takes exactly 1 argument (0 given)
. Sleeping 1 seconds
2015-05-05 11:08:02.654 ERROR nova.openstack.common.threadgroup [req-a34db69c-4189-48bb-b409-f5c9e7057130 None None] Expected int or long, got <class 'eventlet.hubs.IOClosed'>
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/openstack/common/threadgroup.py", line 145, in wait
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup x.wait()
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/opt/stack/nova/nova/openstack/common/threadgroup.py", line 47, in wait
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup return self.thread.wait()
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup return self._exit_event.wait()
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup return hubs.get_hub().switch()
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup return self.greenlet.switch()
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 332, in run
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup self.close_one()
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 265, in close_one
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup listener.tb(IOClosed(errno.ENOTCONN, "Operation on closed file"))
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 59, in on_error
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup original = ds[get_fileno(d)]['error']
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 20, in get_fileno
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup raise TypeError("Expected int or long, got %s" % type(obj))
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup TypeError: Expected int or long, got <class 'eventlet.hubs.IOClosed'>
2015-05-05 11:08:02.654 TRACE nova.openstack.common.threadgroup

Tags: ceph
Eric Harney (eharney)
Changed in nova:
assignee: nobody → Eric Harney (eharney)
status: New → In Progress
tags: added: rbd
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/180207

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/180212

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/180207
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8c69469e54b6d39088865f2e84ddc5de238d4627
Submitter: Jenkins
Branch: master

commit 8c69469e54b6d39088865f2e84ddc5de238d4627
Author: Eric Harney <email address hidden>
Date: Tue May 5 11:29:48 2015 -0400

    Fix missing format arg in compute manager

    Error message is not printing correctly due
    to broken format string.

    Change-Id: Ic153e11e8847418576fde8578f17b842b5fedc46
    Related-Bug: #1451891

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/180212
Reason: This has been WIP'ed for a long time, so I am going to abandon it. Please de-abandon when this code is ready for review.

Changed in nova:
assignee: Eric Harney (eharney) → nobody
status: In Progress → Confirmed
melanie witt (melwitt)
tags: added: ceph
removed: rbd
Revision history for this message
John Garbutt (johngarbutt) wrote :

So I think this is basically a feature request, to ensure RBD calls don't block the main Nova thread.

The fact nova is broken while it can't talk to RBD sounds like a good thing, it's a simpler failure.

Changed in nova:
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.