snapshot delete fails on shutdown VM
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Balazs Gibizer | ||
Queens |
New
|
Undecided
|
Unassigned | ||
Rocky |
In Progress
|
Undecided
|
Unassigned | ||
Stein |
In Progress
|
Undecided
|
Unassigned | ||
Train |
Fix Committed
|
Undecided
|
Unassigned | ||
Ussuri |
Fix Released
|
Undecided
|
Lee Yarwood | ||
Victoria |
Fix Released
|
Undecided
|
Lee Yarwood |
Bug Description
Description:
When we try to delete the last snapshot of a VM in shutdown state, this snapshot_delete will fail (and be stuck in state error-deleting). When setting state==available and redeleting the snapshot, the volume will be corrupted and the VM will never start again. Volumes are stored on NFS.
(for root cause and fix, see the bottom of this post)
To reproduce:
- storage on NFS
- create a VM and some snapshots
- shut down the VM (ie volume is still considered "attached" but vm is no longer "active")
- delete the last snapshot
Expected Result:
snapshot is deleted, vm still works
Actual result:
The snapshot is stuck on error deleting. After setting the snapshot state==available and deleting the snapshot again, the volume will be corrupted and the VM will never start again. (non-existing backing_file in qcow on disk)
Environment:
- openstack version: stein, deployed via kolla-ansible. I suspect this downloads from git but i don't know the exact version.
- hypervisor: Libvirt + KVM
- storage: NFS
- networking: Neutron with OpenVSwitch
Nova debug Logs:
2020-02-06 12:20:10.713 6 ERROR nova.virt.
efault default] [instance: 711651a3-
volume-
2020-02-06 12:20:10.713 6 ERROR nova.virt.
2020-02-06 12:20:10.713 6 ERROR nova.virt.
ver.py", line 2726, in volume_
2020-02-06 12:20:10.713 6 ERROR nova.virt.
2020-02-06 12:20:10.713 6 ERROR nova.virt.
ver.py", line 2686, in _volume_
2020-02-06 12:20:10.713 6 ERROR nova.virt.
2020-02-06 12:20:10.713 6 ERROR nova.virt.
ver.py", line 2519, in _rebase_
2020-02-06 12:20:10.713 6 ERROR nova.virt.
t
2020-02-06 12:20:10.713 6 ERROR nova.virt.
line 58, in qemu_img_info
2020-02-06 12:20:10.713 6 ERROR nova.virt.
2020-02-06 12:20:10.713 6 ERROR nova.virt.
9a2bf0.
2020-02-06 12:20:10.713 6 ERROR nova.virt.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
default default] Exception during message handling: DiskNotFound: No disk at volume-
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.
Root Cause:
When you look at the first line in the debug log: "No disk at volume-
Code path:
- cinder-
- In remotefs.py (called for nfs): if volume is attached -> call nova snapshot_delete
- nova-> snapshot_delete: If snapshot_to_merge == active -> do rebase
- if VM is not active: call _rebase_
- rebase_
Potential, cinder_based fix:
In cinder, qemu_img_info is typically wrapped to always convert the backing_file to a relative path. In the code there are typically 2 variables: backing_file (relative) and backing_
Patch:
--- /usr/lib/
+++ /usr/lib/
@@ -2515,8 +2515,10 @@
# If the rebased image is going to have a backing file then
# explicitly set the backing file format to avoid any security
# concerns related to file format auto detection.
- backing_file = rebase_base
- b_file_fmt = images.
+ backing_file = os.path.
+ volume_path = os.path.
+ backing_path = os.path.
+ b_file_fmt = images.
Note:
We've only been able to test on Stein. However based on a code analysis of the current code in git, this bug is probably still present.
Changed in nova: | |
assignee: | Balazs Gibizer (balazs-gibizer) → Lee Yarwood (lyarwood) |
Changed in nova: | |
assignee: | Lee Yarwood (lyarwood) → Stephen Finucane (stephenfinucane) |
Changed in nova: | |
assignee: | Stephen Finucane (stephenfinucane) → Balazs Gibizer (balazs-gibizer) |
no longer affects: | nova/trunk |
Seems the identation of the patch was corrupted in the original bug report Here's the patch as attachment.