OpenStack Compute (nova)

Rescheduling loses reasons

Bug #1161661 reported by Joshua Harlow on 2013-03-28

94

This bug affects 16 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Andrew Laski	OpenStack Compute (nova) 2015.1.0 "kilo"

Bug Description

In nova.compute.manager when an instance is rescheduled (for whatever reason) the exception that caused said rescheduling is only logged, and not shown to the user in any fashion. In the extreme case this can cause the user to have no idea what happened when rescheduling finally fails.

For example:

Say the following happens, on schedule instance 1, on hypervisor A it errors with error X then rescheduled to hypervisor B, which
errors with error Y, then next can't reschedule due to no more hypervisors being able to be scheduled to (aka no more compute nodes), then you basically get an error that says no more instances to schedule on (which is not connected to the original error in any fashion).

Likely there needs to be a record of the rescheduling exceptions, or rescheduling needs to be rethought, where a orchestration unit can perform this rescheduling and be more aware of the rescheduling attempts (and there success and failures).

Revision history for this message

Andrew Laski (alaski) wrote on 2013-03-29:

#1

The exceptions are stored as instance faults, but that information is not exposed. There is another place to keep this which is exposed, the instance actions and events tables. Currently scheduling events are occurring in the scheduler manager which may not catch all exceptions that can occur. That should probably move up a level or be extended in order to capture all exceptions.

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium
assignee:	nobody → Andrew Laski (alaski)

Revision history for this message

Joshua Harlow (harlowja) wrote on 2013-03-30:

#2

I guess there is a general question of how much info to expose in the first place. With the way rescheduling is done right now its more of an all or nothing process. With a higher level 'entity' doing rescheduling the final message could be delivered more 'smartly' and likely in a better manner than telling users what the whole path of 'exceptions' is that caused the final error. But I guess exposing is at least a start (if that's really info we want to expose in the first place...

Revision history for this message

Andrew Laski (alaski) wrote on 2013-04-01:

#3

Since we have a mechanism for exposing this sort of information to admins I think a good start would be to get the information in there. I am very much in favor of reworking the whole process to be handled more smartly by a higher level concept, but that probably gets out of the realm of bug report and into design discussions and blueprints. So while that is happening in a parallel effort we can address this concern in a more immediate manner.

Revision history for this message

Andrew Laski (alaski) wrote on 2013-04-02:

#4

Looking at this closer, it appears that NoValidHost exceptions are caught in the scheduler manager and not re-raised, thus not getting captured by the event tracking the scheduling.

Revision history for this message

Guangya Liu (Jay Lau) (jay-lau-513) wrote on 2013-05-11:

#5

Someone also report this issue in https://bugs.launchpad.net/nova/+bug/1165034

What about use following solution:
When retry filter failed to find a target hypervisor node, then do not put "NoValidHost" defects to the table of instance_faults and this can make sure the instance_faults can keep the last error from nova compute, this can make sure customer can know what happened on the last hypervisor.

Comments?

Revision history for this message

Joe Gordon (jogo) wrote on 2013-05-30:

#6

One a related note, when the retry filter is disabled, nova-compute still attempts a retry. This breaks the paradigm of making the filters optional.

Revision history for this message

Andrew Laski (alaski) wrote on 2013-05-30:

#7

Joe, the retry behaviour is controlled by CONF.scheduler_max_attempts in scheduler/driver.py. The retry filter just keeps it from getting rescheduled to the same host.

Revision history for this message

Tiantian Gao (gtt116) wrote on 2013-06-09:

#8

Is scheduling a action like 'start'/'reboot'/'resize', or it just is a internal action, which we don't want to expose it to user.
Maybe we can make two type of `action`, one is public, the other is private. 'start'/'reboot' belong to the former, and they can expose to user, the latter like 'schedule'/'reschedule' just expose to admin.

Revision history for this message

Andrew Laski (alaski) wrote on 2013-06-10:

#9

@TianTian you make a good point. Scheduling is an internal part of an action such as boot or resize. So it does seem like we need different levels of exposure for things. I agree with your notion of exposing scheduling failures to admins, and boot/resize failures to users.

Revision history for this message

Brad Pokorny (bpokorny) wrote on 2013-11-18:

#10

This blueprint implements part of the fix for this bug: https://blueprints.launchpad.net/nova/+spec/remove-cast-to-schedule-run-instance

Revision history for this message

Qiu Yu (unicell) wrote on 2013-11-26:

#11

A proposed fix: using nova instance action event table to record the reason causing rescheduling
https://review.openstack.org/#/c/58506/

Revision history for this message

Shannon McFarland (shmcfarl) wrote on 2014-01-07:

#12

I see that https://review.openstack.org/#/c/58506/ is lacking one more approver. I still see this issue every time I try to boot and Ubuntu or Fedora image on m1.tiny. It works on m1.small and above. The log shows (as it did in bug: https://bugs.launchpad.net/nova/+bug/1245276):

2014-01-07 15:28:09.322 31221 TRACE nova.compute.manager [instance: d05aca27-855d-4ade-968e-b12628644c5f] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/imagebackend.py", line 312, in create_image
2014-01-07 15:28:09.322 31221 TRACE nova.compute.manager [instance: d05aca27-855d-4ade-968e-b12628644c5f] raise exception.InstanceTypeDiskTooSmall()
2014-01-07 15:28:09.322 31221 TRACE nova.compute.manager [instance: d05aca27-855d-4ade-968e-b12628644c5f] InstanceTypeDiskTooSmall: Instance type's disk is too small for requested image.

We really need to get this resolved. It happens on every deployment we have. Thanks.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-03-16:

#13

We have renamed InstanceTypeDisk to FlavorDiskTooSmall and the reason does get back to the end user.

Changed in nova:
status:	Confirmed → Fix Committed

Thierry Carrez (ttx) on 2015-03-20

Changed in nova:
milestone:	none → kilo-3
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in nova:
milestone:	kilo-3 → 2015.1.0

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Show internal scheduler information

Remote bug watches

Bug watches keep track of this bug in other bug trackers.