Rescheduling loses reasons

Bug #1161661 reported by Joshua Harlow
94
This bug affects 16 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Andrew Laski

Bug Description

In nova.compute.manager when an instance is rescheduled (for whatever reason) the exception that caused said rescheduling is only logged, and not shown to the user in any fashion. In the extreme case this can cause the user to have no idea what happened when rescheduling finally fails.

For example:

Say the following happens, on schedule instance 1, on hypervisor A it errors with error X then rescheduled to hypervisor B, which
errors with error Y, then next can't reschedule due to no more hypervisors being able to be scheduled to (aka no more compute nodes), then you basically get an error that says no more instances to schedule on (which is not connected to the original error in any fashion).

Likely there needs to be a record of the rescheduling exceptions, or rescheduling needs to be rethought, where a orchestration unit can perform this rescheduling and be more aware of the rescheduling attempts (and there success and failures).

Revision history for this message
Andrew Laski (alaski) wrote :

The exceptions are stored as instance faults, but that information is not exposed. There is another place to keep this which is exposed, the instance actions and events tables. Currently scheduling events are occurring in the scheduler manager which may not catch all exceptions that can occur. That should probably move up a level or be extended in order to capture all exceptions.

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Andrew Laski (alaski)
Revision history for this message
Joshua Harlow (harlowja) wrote :

I guess there is a general question of how much info to expose in the first place. With the way rescheduling is done right now its more of an all or nothing process. With a higher level 'entity' doing rescheduling the final message could be delivered more 'smartly' and likely in a better manner than telling users what the whole path of 'exceptions' is that caused the final error. But I guess exposing is at least a start (if that's really info we want to expose in the first place...

Revision history for this message
Andrew Laski (alaski) wrote :

Since we have a mechanism for exposing this sort of information to admins I think a good start would be to get the information in there. I am very much in favor of reworking the whole process to be handled more smartly by a higher level concept, but that probably gets out of the realm of bug report and into design discussions and blueprints. So while that is happening in a parallel effort we can address this concern in a more immediate manner.

Revision history for this message
Andrew Laski (alaski) wrote :

Looking at this closer, it appears that NoValidHost exceptions are caught in the scheduler manager and not re-raised, thus not getting captured by the event tracking the scheduling.

Revision history for this message
Guangya Liu (Jay Lau) (jay-lau-513) wrote :

Someone also report this issue in https://bugs.launchpad.net/nova/+bug/1165034

What about use following solution:
When retry filter failed to find a target hypervisor node, then do not put "NoValidHost" defects to the table of instance_faults and this can make sure the instance_faults can keep the last error from nova compute, this can make sure customer can know what happened on the last hypervisor.

Comments?

Revision history for this message
Joe Gordon (jogo) wrote :

One a related note, when the retry filter is disabled, nova-compute still attempts a retry. This breaks the paradigm of making the filters optional.

Revision history for this message
Andrew Laski (alaski) wrote :

Joe, the retry behaviour is controlled by CONF.scheduler_max_attempts in scheduler/driver.py. The retry filter just keeps it from getting rescheduled to the same host.

Revision history for this message
Tiantian Gao (gtt116) wrote :

Is scheduling a action like 'start'/'reboot'/'resize', or it just is a internal action, which we don't want to expose it to user.
Maybe we can make two type of `action`, one is public, the other is private. 'start'/'reboot' belong to the former, and they can expose to user, the latter like 'schedule'/'reschedule' just expose to admin.

Revision history for this message
Andrew Laski (alaski) wrote :

@TianTian you make a good point. Scheduling is an internal part of an action such as boot or resize. So it does seem like we need different levels of exposure for things. I agree with your notion of exposing scheduling failures to admins, and boot/resize failures to users.

Revision history for this message
Brad Pokorny (bpokorny) wrote :

This blueprint implements part of the fix for this bug: https://blueprints.launchpad.net/nova/+spec/remove-cast-to-schedule-run-instance

Revision history for this message
Qiu Yu (unicell) wrote :

A proposed fix: using nova instance action event table to record the reason causing rescheduling
https://review.openstack.org/#/c/58506/

Revision history for this message
Shannon McFarland (shmcfarl) wrote :

I see that https://review.openstack.org/#/c/58506/ is lacking one more approver. I still see this issue every time I try to boot and Ubuntu or Fedora image on m1.tiny. It works on m1.small and above. The log shows (as it did in bug: https://bugs.launchpad.net/nova/+bug/1245276):

2014-01-07 15:28:09.322 31221 TRACE nova.compute.manager [instance: d05aca27-855d-4ade-968e-b12628644c5f] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/imagebackend.py", line 312, in create_image
2014-01-07 15:28:09.322 31221 TRACE nova.compute.manager [instance: d05aca27-855d-4ade-968e-b12628644c5f] raise exception.InstanceTypeDiskTooSmall()
2014-01-07 15:28:09.322 31221 TRACE nova.compute.manager [instance: d05aca27-855d-4ade-968e-b12628644c5f] InstanceTypeDiskTooSmall: Instance type's disk is too small for requested image.

We really need to get this resolved. It happens on every deployment we have. Thanks.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

We have renamed InstanceTypeDisk to FlavorDiskTooSmall and the reason does get back to the end user.

Changed in nova:
status: Confirmed → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.