I know about how that periodic can hit a failure for one instance which blows the whole task for all of them on that same host, similar to bug 1662867:
In fact, ^ was backported to Ocata but not Newton so are you sure you're not hitting that issue instead?
Otherwise just generically ignoring DiskNotFound could be risky if we hit that for some other reason than the one you describe (the disk on the host is busted).
I'm inclined to mark this bug as Opinion since "It would be better if this was logged, but the other stats CPU/Memory were able to be updated." is an opinion IMO - one could argue that if the disk is corrupted on the host, we shouldn't be reporting stats on the compute since the scheduler could incorrectly select it for a new build.
Maybe there are other options here? Like maybe adding a counter for how many times we trip over this for an instance that's in steady state (task_state is None) and still on the hypervisor - if we hit that say 10 times we consider it fatal and auto-disable the compute until the operator fixes the problem?
I know about how that periodic can hit a failure for one instance which blows the whole task for all of them on that same host, similar to bug 1662867:
https:/ /review. openstack. org/#/c/ 553067/
In fact, ^ was backported to Ocata but not Newton so are you sure you're not hitting that issue instead?
Otherwise just generically ignoring DiskNotFound could be risky if we hit that for some other reason than the one you describe (the disk on the host is busted).
I'm inclined to mark this bug as Opinion since "It would be better if this was logged, but the other stats CPU/Memory were able to be updated." is an opinion IMO - one could argue that if the disk is corrupted on the host, we shouldn't be reporting stats on the compute since the scheduler could incorrectly select it for a new build.
Maybe there are other options here? Like maybe adding a counter for how many times we trip over this for an instance that's in steady state (task_state is None) and still on the hypervisor - if we hit that say 10 times we consider it fatal and auto-disable the compute until the operator fixes the problem?