gate-tempest-dsvm-postgres-full fails - unable to schedule instance

Bug #1257644 reported by Gary Kotton
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Critical
Unassigned

Bug Description

The trace for the failure is:

2013-12-04 08:01:19.167 INFO nova.filters [req-9b187d9f-258f-42ef-8235-973de78a46d9 InstanceActionsV3TestXML-tempest-1651913648-user InstanceActionsV3TestXML-tempest-1651913648-tenant] Filter ComputeFilter returned 0 hosts
2013-12-04 08:01:19.168 WARNING nova.scheduler.driver [req-9b187d9f-258f-42ef-8235-973de78a46d9 InstanceActionsV3TestXML-tempest-1651913648-user InstanceActionsV3TestXML-tempest-1651913648-tenant] [instance: 78a1af24-f108-404c-a9e8-a021362c206b] Setting instance to ERROR state.
2013-12-04 08:01:27.352 INFO nova.scheduler.filter_scheduler [req-24c3c05c-c371-41b6-a314-3fa91340ed18 ImagesV3TestJSON-tempest-309232411-user ImagesV3TestJSON-tempest-309232411-tenant] Attempting to build 1 instance(s) uuids: [u'7e39573c-4ffd-4405-a109-cdca0685fe9c']

Revision history for this message
Joe Gordon (jogo) wrote :

the scheduler failed to schedule:
http://logs.openstack.org/04/51404/7/gate/gate-tempest-dsvm-postgres-full/b1825bf/logs/screen-n-sch.txt.gz?level=INFO#_2013-12-04_08_01_19_168

Its unclear to me if this is an issue with tempest trying to consume too many resources or something wrong in nova.

Revision history for this message
Joe Gordon (jogo) wrote :

As Matt Riedemann pointed out in a duplicate bug it looks like a heartbeat issue with nova-compute

http://logs.openstack.org/04/51404/7/gate/gate-tempest-dsvm-postgres-full/b1825bf/logs/screen-n-sch.txt.gz?level=DEBUG#_2013-12-04_08_01_19_167

2013-12-04 08:01:19.167 DEBUG nova.servicegroup.drivers.db [req-9b187d9f-258f-42ef-8235-973de78a46d9 InstanceActionsV3TestXML-tempest-1651913648-user InstanceActionsV3TestXML-tempest-1651913648-tenant] DB_Driver.is_up last_heartbeat = 2013-12-04 08:01:19.008808 elapsed = 0.158322 is_up /opt/stack/new/nova/nova/servicegroup/drivers/db.py:71

A bug like this was seen before in https://review.openstack.org/#/c/46336/ https://bugs.launchpad.net/nova/+bug/1221987

Changed in nova:
status: New → Triaged
Joe Gordon (jogo)
no longer affects: tempest
Revision history for this message
Joe Gordon (jogo) wrote :

Yet another gate issue, marking as critical.

Changed in nova:
milestone: none → icehouse-2
importance: Undecided → Critical
Revision history for this message
Matt Riedemann (mriedem) wrote :

I've got an e-r query patch proposal here: https://review.openstack.org/#/c/60508/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Hmm, so I did this query:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRmlsdGVyIENvbXB1dGVGaWx0ZXIgcmV0dXJuZWQgMCBob3N0c1wiIEFORCBmaWxlbmFtZTpcImxvZ3Mvc2NyZWVuLW4tc2NoLnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg2MzM3Mzc0Njk0fQ==

Which also turns this up:

2013-12-05 23:39:35.662 DEBUG nova.scheduler.filters.compute_filter [req-416eeb14-f13d-41a5-90f0-84fe1f589100 ServersAdminTestXML-tempest-546887800-user ServersAdminTestXML-tempest-546887800-tenant] (devstack-precise-check-hpcloud-region-b-801504, devstack-precise-check-hpcloud-region-b-801504) ram:7154 disk:113664 io_ops:0 instances:5 is disabled, reason: None host_passes /opt/stack/new/nova/nova/scheduler/filters/compute_filter.py:43

2013-12-05 23:39:35.662 INFO nova.filters [req-416eeb14-f13d-41a5-90f0-84fe1f589100 ServersAdminTestXML-tempest-546887800-user ServersAdminTestXML-tempest-546887800-tenant] Filter ComputeFilter returned 0 hosts

2013-12-05 23:39:35.731 WARNING nova.scheduler.driver [req-416eeb14-f13d-41a5-90f0-84fe1f589100 ServersAdminTestXML-tempest-546887800-user ServersAdminTestXML-tempest-546887800-tenant] [instance: b44877c0-cd63-4812-b679-9d1594fcf3e4] Setting instance to ERROR state.

The difference is rather than seeing "is disabled or has not been heard from in a while" for host, we see "is disabled, reason: None" - but still results in the failure. That's coming from this log:

http://logs.openstack.org/47/54647/7/check/check-tempest-dsvm-postgres-full/70389b4/logs/screen-n-sch.txt

Revision history for this message
Matt Riedemann (mriedem) wrote :

For the above log with the disabled service, we see the same tempest failure:

http://logs.openstack.org/47/54647/7/check/check-tempest-dsvm-postgres-full/70389b4/console.html

BuildErrorException: Server b44877c0-cd63-4812-b679-9d1594fcf3e4 failed to build and is in ERROR status

Revision history for this message
Matt Riedemann (mriedem) wrote :

Maybe a red herring, but I'm suspicious of something going into nova recently which is maybe automatically disabling a host, or screwing something up in the set_host_enabled flow, these are the most recent changes in that path:

Change Ib4cc80e5: Add notification for host operation
Change Ib8d67838: Disable nova-compute on libvirt connectivity exceptions
Change Ia7544a35: Conditionalise automatic enabling of disabled host
Change I394e92fe: libvirt: Fix log message when disable/enable a host

I think it's got to be Ib8d67838 or Ia7544a35. Thinking is we have a libvirt connection hiccup which automatically disables the host and since tempest is running tests in parallel, it hits that window and the ComputeFilter doesn't return the host since it's automatically disabled, and then kaboom!

Revision history for this message
Matt Riedemann (mriedem) wrote :

I can't do a logstash query on whether or not the host was automatically disabled because in the ComputeFilter it logs a disabled host at debug level and we don't index debug level messages. And with change Ia7544a35 it only adds the "AUTO:" disabled_reason flag in the libvirt driver, so checking for that in the ComputerFilter would be a hack, long-term we really need a column in the Service model to indicate if the enable/disable action was automatic or not as suggested in bug 1250049.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Furthermore, the AUTO disable flag change Ia7544a35 merged on 11/21 but Phil's patch to log the disabled_reason at debug level didn't merge until 12/5, so we'd have to keep an eye out for it in any new rechecks against this bug.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like the libvirt close callback is only registered if we're running on libvirt >= 1.0.1, do we know if that's the case?

Revision history for this message
Matt Riedemann (mriedem) wrote :

The logic here is killing me:

https://review.openstack.org/#/c/56224/4/nova/virt/libvirt/driver.py

I'm not really sure how that "AUTO:" prefix is going to be set on the disabled_reason when disabling the service, especially with the changes to _close_callback here:

https://review.openstack.org/#/c/52189/9/nova/virt/libvirt/driver.py

It looks like if the connection that's being closed is not the one that we registered the callback against (not sure how that would happen?), then the _error var isn't defined and you'd get a NameError at runtime.

Matt Riedemann (mriedem)
tags: added: libvirt scheduler
Revision history for this message
Matt Riedemann (mriedem) wrote :

We can probably take the libvirt _close_connection callback out of the equation here since we're running with libvirt 0.9.8:

"Setting up libvirt0 (0.9.8-2ubuntu17.16) ..."

And the registerCloseCallback is in version >= 1.0.1, so that can't be the problem.

That leaves the finally block in _get_connection added here:

https://review.openstack.org/#/c/52189/9/nova/virt/libvirt/driver.py

So in that case wrapped_conn would have to be None which makes bool(None) result in False and that would disable the service. The nasty thing about that is the disabled_reason is set to '' so we don't even know that we had an automatic disable of the host, which is what change Ia7544a35 was trying to address.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Ignore the eventlet bug report, that was fixed in 0.13 which nova requires at a minimum.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/60584

Revision history for this message
Matt Riedemann (mriedem) wrote :

Opened related bug 1258625 to get rid of the hack introduced by "fixing" bug 1250049.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/60563
Committed: http://github.com/openstack/nova/commit/ea091907f1c9168494ec1c0ea875e55c71868996
Submitter: Jenkins
Branch: master

commit ea091907f1c9168494ec1c0ea875e55c71868996
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 6 08:54:14 2013 -0800

    Fix UnboundLocalError in libvirt.driver._close_callback

    If for whatever reason _wrapped_conn is None or not equal to the conn
    parameter passed to the _close_callback method, the local _error
    variable won't be set when calling set_host_enabled and result in an
    UnboundLocalError.

    Partial-Bug: #1257644

    Change-Id: I40855f22d5449f3b3646c90fe2d9a71560eb86c4

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Wondering if maybe this fixed the libvirt connection fail? https://review.openstack.org/57000

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually looks like still hitting this on 12/9 so 57000 doesn't fix it.

message:"Connection to libvirt failed" AND filename:"logs/screen-n-cpu.txt"

Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1251521 might be related.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/60584
Committed: http://github.com/openstack/nova/commit/4b6df96adec23ba90fd6277d88d7edd6c514fe73
Submitter: Jenkins
Branch: master

commit 4b6df96adec23ba90fd6277d88d7edd6c514fe73
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 6 10:39:34 2013 -0800

    libvirt: add a test to guard against set_host_enabled raising an error

    Commit cc70e3a adds a finally block to _get_new_connection to
    automatically disable the host if making a connection to libvirt fails.

    The set_host_enabled method currently swallows all exceptions which is
    good so it doesn't mask the original libvirt failure in
    _get_new_connection, but we need to have a test that makes sure that
    isn't unintentionally regressed. This patch adds the test.

    Related-Bug: #1257644

    Change-Id: I4beec053794edc75ef2d0060a05285f7ed0b99ac

Revision history for this message
Matt Riedemann (mriedem) wrote :

The root issue here is the libvirt connection getting reset and then the libvirt driver code automatically setting the host to disabled so the scheduler skips over it and can't schedule the instance. So tracking this as a dupe of bug 1255624.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.