Nova Compute Manager (Resource update) fails if a disk is missing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
In Progress
|
Low
|
Vladyslav Drok |
Bug Description
===Description===
We recently ran into an issue with the periodic resource update on a kvm hypervisor.
if for some reason a disk is missing or unreadable then the periodic resource updater will fail with
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.
This is of course expected, in that the disk is missing, but the issue arises by the fact that the nova.compute.
It would be better if this was logged, but the other stats CPU/Memory were able to be updated.
Steps to reproduce
===================
1. Boot an instance on a hypervisor
2. shut the instance down
3. rename the folder the disk is located in
4. watch the logs for the above error
5. Build additional instances on hypervisor
6. Look and see if stats are updated.
Expected results
=================
CPU and memory stats should still be updated, with maybe disk stats being not updated, or marked as stale?
Actual Results
=================
No stats are updated for the hypervisor.
Environment
===================
Newton OpenStack - Although looking at the latest code, i think this is still an issue in the latest release.
tags: | added: compute resource-tracker |
Changed in nova: | |
status: | New → Triaged |
importance: | Undecided → Low |
Changed in nova: | |
assignee: | nobody → Vladyslav Drok (vdrok) |
status: | Triaged → In Progress |
I know about how that periodic can hit a failure for one instance which blows the whole task for all of them on that same host, similar to bug 1662867:
https:/ /review. openstack. org/#/c/ 553067/
In fact, ^ was backported to Ocata but not Newton so are you sure you're not hitting that issue instead?
Otherwise just generically ignoring DiskNotFound could be risky if we hit that for some other reason than the one you describe (the disk on the host is busted).
I'm inclined to mark this bug as Opinion since "It would be better if this was logged, but the other stats CPU/Memory were able to be updated." is an opinion IMO - one could argue that if the disk is corrupted on the host, we shouldn't be reporting stats on the compute since the scheduler could incorrectly select it for a new build.
Maybe there are other options here? Like maybe adding a counter for how many times we trip over this for an instance that's in steady state (task_state is None) and still on the hypervisor - if we hit that say 10 times we consider it fatal and auto-disable the compute until the operator fixes the problem?