I'm trying to think of a simple way to at least mark some failures as not something to be counted against the consecutive build failure count. The easy case is volume overquota when booting from volume and nova is trying to create a volume. We handle that OverQuota here and raise BuildAbortException:
The problem is, that method returns essentially an enum (build_results.FAILED) and that's what is checked when counting against build failures to disable the compute:
If we were handling an OverQuota error, we could set a flag on BuildAbortException and check that later up the stack in the _build_failed() logic but since _build_failed() doesn't get the actual exception, just build_results.FAILED enum (which only exists for the "build_instance" hook by the way), we can't do that.
One super hacky thing we could do is if we get OverQuota during _prep_block_devices is reset self._failed_builds to 0 like here:
But if consecutive_build_service_disable_threshold is set to 1, then _build_failed() will still disable the compute service which is what we don't want in this case.
I'm trying to think of a simple way to at least mark some failures as not something to be counted against the consecutive build failure count. The easy case is volume overquota when booting from volume and nova is trying to create a volume. We handle that OverQuota here and raise BuildAbortExcep tion:
https:/ /github. com/openstack/ nova/blob/ 7bdb7dbbddf9fcb 4284d490bf315d6 756f4015e7/ nova/compute/ manager. py#L2209
That BuildAbortException is eventually handled here:
https:/ /github. com/openstack/ nova/blob/ 7bdb7dbbddf9fcb 4284d490bf315d6 756f4015e7/ nova/compute/ manager. py#L1903
The problem is, that method returns essentially an enum (build_ results. FAILED) and that's what is checked when counting against build failures to disable the compute:
https:/ /github. com/openstack/ nova/blob/ 7bdb7dbbddf9fcb 4284d490bf315d6 756f4015e7/ nova/compute/ manager. py#L1750
So the code in _build_failed() doesn't have context about the actual failure which makes whitelisting certain types of failures hard.
I was thinking that in this block when we raise BuildAbortExcep tion:
https:/ /github. com/openstack/ nova/blob/ 7bdb7dbbddf9fcb 4284d490bf315d6 756f4015e7/ nova/compute/ manager. py#L2209
If we were handling an OverQuota error, we could set a flag on BuildAbortException and check that later up the stack in the _build_failed() logic but since _build_failed() doesn't get the actual exception, just build_results. FAILED enum (which only exists for the "build_instance" hook by the way), we can't do that.
One super hacky thing we could do is if we get OverQuota during _prep_block_devices is reset self._failed_builds to 0 like here:
https:/ /github. com/openstack/ nova/blob/ 7bdb7dbbddf9fcb 4284d490bf315d6 756f4015e7/ nova/compute/ manager. py#L2209
But if consecutive_ build_service_ disable_ threshold is set to 1, then _build_failed() will still disable the compute service which is what we don't want in this case.