Comment 9 for bug 1742102

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm trying to think of a simple way to at least mark some failures as not something to be counted against the consecutive build failure count. The easy case is volume overquota when booting from volume and nova is trying to create a volume. We handle that OverQuota here and raise BuildAbortException:

https://github.com/openstack/nova/blob/7bdb7dbbddf9fcb4284d490bf315d6756f4015e7/nova/compute/manager.py#L2209

That BuildAbortException is eventually handled here:

https://github.com/openstack/nova/blob/7bdb7dbbddf9fcb4284d490bf315d6756f4015e7/nova/compute/manager.py#L1903

The problem is, that method returns essentially an enum (build_results.FAILED) and that's what is checked when counting against build failures to disable the compute:

https://github.com/openstack/nova/blob/7bdb7dbbddf9fcb4284d490bf315d6756f4015e7/nova/compute/manager.py#L1750

So the code in _build_failed() doesn't have context about the actual failure which makes whitelisting certain types of failures hard.

I was thinking that in this block when we raise BuildAbortException:

https://github.com/openstack/nova/blob/7bdb7dbbddf9fcb4284d490bf315d6756f4015e7/nova/compute/manager.py#L2209

If we were handling an OverQuota error, we could set a flag on BuildAbortException and check that later up the stack in the _build_failed() logic but since _build_failed() doesn't get the actual exception, just build_results.FAILED enum (which only exists for the "build_instance" hook by the way), we can't do that.

One super hacky thing we could do is if we get OverQuota during _prep_block_devices is reset self._failed_builds to 0 like here:

https://github.com/openstack/nova/blob/7bdb7dbbddf9fcb4284d490bf315d6756f4015e7/nova/compute/manager.py#L2209

But if consecutive_build_service_disable_threshold is set to 1, then _build_failed() will still disable the compute service which is what we don't want in this case.