Unshelving a VM breaks instance metadata when using qcow2 backed images
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Alexandre arents | ||
Ocata |
Confirmed
|
Medium
|
Unassigned | ||
Pike |
Confirmed
|
Medium
|
Unassigned | ||
Train |
Fix Released
|
Medium
|
Lee Yarwood | ||
Ussuri |
Fix Released
|
Medium
|
Lee Yarwood |
Bug Description
If you unshelve instances on compute nodes that use qcow2 backed instances, the instance image_ref will point to the original image the VM was lauched from. The base file for /var/lib/
Steps to reproduce/what happens:
Have at least 2 compute nodes configured with the standard qcow2 backed images.
1) Launch an instance.
2) Shelve the instance. In the background this should in practice create a flattened snapshot of the VM.
3) Unshelve the instance. The instance will boot on one of the compute nodes. The /var/lib/
4) Resize/migrate the instance. /var/lib/
5a) If the instance was running: When nova tries to start the VM, it will copy the original base image to the new compute node, not the snapshot base image. The instance can't boot, since it doesn't find its actual base file, and it goes to an ERROR state.
5b) If the instance was shutdown: You can confirm the resize, but the VM won't start. The snapshot base file may be removed from the source machine causing dataloss.
What should have happened:
Either the instance image_ref should be updated to the snapshot image, or the snapshot image should be rebased to the original image, or is should force a raw only image after unshelve, or something else you smart people come up with.
Environment:
RDO Neutron with KVM
rpm -qa |grep nova
openstack-
python2-
python-
openstack-
Also a big thank you to Toni Peltonen and Anton Aksola from nebula.fi for discovering and debugging this issue.
Nice analysis.
You're correct that when we shelve an instance, we create a snapshot image, starting in the API:
https:/ /github. com/openstack/ nova/blob/ b6a245f0425a07b e3871a976952646 d2bdd44533/ nova/compute/ api.py# L3244
That snapshot image_id is passed down to the compute service to do the actual snapshot and upload from the virt driver:
https:/ /github. com/openstack/ nova/blob/ b6a245f0425a07b e3871a976952646 d2bdd44533/ nova/compute/ manager. py#L4598
We then store that snapshot image_id in the instance system_metadata for later when it's unshleved:
https:/ /github. com/openstack/ nova/blob/ b6a245f0425a07b e3871a976952646 d2bdd44533/ nova/compute/ manager. py#L4601
When we unshelve, we get that snapshot image from glance:
https:/ /github. com/openstack/ nova/blob/ b6a245f0425a07b e3871a976952646 d2bdd44533/ nova/conductor/ manager. py#L641
We then use that to update the instance.image_ref field to point at the snapshot image:
https:/ /github. com/openstack/ nova/blob/ b6a245f0425a07b e3871a976952646 d2bdd44533/ nova/compute/ manager. py#L4764
It looks like the problem is that we then reset the instance.image_ref to the old image id before we unshelved:
https:/ /github. com/openstack/ nova/blob/ b6a245f0425a07b e3871a976952646 d2bdd44533/ nova/compute/ manager. py#L4796
I have no idea why we do that, and that's probably the bug.