Retry after hitting libvirt error code VIR_ERR_OPERATION_INVALID in live migration.

Bug #1799152 reported by Fan Zhang
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Fan Zhang

Bug Description

When migration of a persistent guest completes, the guest merely shuts
off, but libvirt unhelpfully raises an VIR_ERR_OPERATION_INVALID error
code, in nova, we pretend this case means success. But if we are in the
middle of a live migration, and sadly qemu-kvm process is killed
accidentally, such as by host OOM, which happens rarely in our environment but it does happen few times, domain state is SHUTOFF and then we will get
VIR_ERR_OPERATION_INVALID while trying to call `self._domain.jobStats()`.

Under the circumstance, migration should be considered failed, otherwise
post_live_migration() function starts to clean up instance files and we will lose customers' data forever.

IMHO, we may need to `pretend` the migration job is still running after
hitting VIR_ERR_OPERATION_INVALID and retry to get job stats for a few times, which the count of retries can be configured. Because if migration
succeeds finally, we won't get VIR_ERR_OPERATION_INVALID after some
retries, but the error code still happens if qemu-kvm process is killed

Steps to reproduce
* Do nova live-migration <uuid> on controller node.
* Once live migration monitor on source compute node starts to get JobInfo, kill the qemu-kvm process on source host.
* Check if post_live_migration on source host starts to execute.
* Check if post_live_migration on destination host starts to execute.
* Check image files on both source host and destination host.

Expected result

Migration should be consider failed.

Actual result

Post live migration on source host starts to execute and clean instance files. Instance disappears on both source and destination host.

1. My environment is packstack with one controller nodes, two compute nodes, and openstack nova release is Queens.

2. Libvirt + KVM

Logs & Configs

Some logs after qemu-kvm process is killed.

2018-09-21 14:08:34.180 11099 DEBUG nova.virt.libvirt.migration [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Downtime does not need to change update_downtime /usr/lib/python2.7/site-packages/nova/virt/libvirt/
2018-09-21 14:08:34.305 11099 DEBUG nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/
2018-09-21 14:08:34.886 11099 DEBUG nova.virt.libvirt.guest [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] Domain has shutdown/gone away: Requested operation is not valid: domain is not running get_job_info /usr/lib/python2.7/site-packages/nova/virt/libvirt/
2018-09-21 14:08:34.887 11099 INFO nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration operation has completed
2018-09-21 14:08:34.887 11099 INFO nova.compute.manager [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] _post_live_migration() is started..

Fan Zhang (fanzhang)
Changed in nova:
assignee: nobody → Fan Zhang (fanzhang)
Fan Zhang (fanzhang)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master

Changed in nova:
status: New → In Progress
Fan Zhang (fanzhang)
description: updated
tags: added: live-migration
tags: added: libvirt
Revision history for this message
melanie witt (melwitt) wrote :

This bug looks valid, marking as High as it involves data loss.

Changed in nova:
importance: Undecided → High
Revision history for this message
sean mooney (sean-k-mooney) wrote :

i dont think i agree that we should retry as if the vm has been killed its unlikely that we woudl be able to recover.

as noted on the review, if you are doing a post copy migration and either vm is killed in the post copy phase then you will lose the contentce of the guest ram.

if we are doing a pre copy migration and the dest vm is killed before we pause the source vm we could revert the migiration, mark it as faild and allow the vm to continue to execute on the source node.
the alternitve woudl be to catach the invalid operation and restart the live migratrion unless libvirt internally will detect the dest vm exited and recreate it? i doubt it will do that so if we get an invalid operation when we get the job stats i dont think a retry will ever succeed.

Revision history for this message
Fan Zhang (fanzhang) wrote :

In our case, we didn't permit post copy live migration, a pre copy migration was executing, and during migration process, vm, aka qemu process on the *source node* was killed due to host OOM. The domain status is SHUTOFF, then in get_job_info(), self._domain.jobStats() got a libvirt error 'VIR_ERR_OPERATION_INVALID'. In previous code, nova thinks the domain is shutdown or gone away, so it happily return JobInfo(type=libvirt.VIR_DOMAIN_JOB_COMPLETED), but it will eventually trigger post_live_migration() to delete source vm files. That's why I report this bug.

IMHO, if qemu-kvm process was killed by source host OOM, we would get error code VIR_ERR_OPERATION_INVALID reported by libvirt because domain state is SHUTOFF and we try to execute `self._domain.jobStats()`. In this case, migration job should be considered failed. If migration succeeded, libvirt would also kill qemu-kvm process, and domain state is SHUTOFF. Then we could get error code VIR_ERR_OPERATION_INVALID, but in such case, we should consider VIR_ERR_OPERATION_INVALID as nothing, just return JobInfo with type=VIR_DOMAIN_JOB_COMPLETED. The difference between these two cases is that under the latter circumstances, we would eventually get VIR_ERR_NO_DOMAIN if we try to get job info for couple of more times.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i think this is actully another example of
in that we had a general problem where we tread all event as success.

i agree that its a bug that we treat it as sucess and then end up deleting the vm but not with the retry. when we get VIR_ERR_OPERATION_INVALID i think we should fail the migration immediately and rollback without retrying.

the fix for has not been backported to queens yet
but if we look at the change
the primary thing we are doing is looking at the detail of the event to determin if the
libvirt.VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event was a success and signals the completion of the migration or if it was an error.

by the way i assume you are refering to stack/nova/blob/stable/queens/nova/virt/libvirt/ when you say "nova thinks the domain is shutdown or gone away, so it happily return JobInfo(type=libvirt.VIR_DOMAIN_JOB_COMPLETED),"

on queens we treat all the VIR_DOMAIN_EVENT_* as sucess and our huristic fo determining if a migration suceeded wont handel the OOM case so we proceed to post live migrate when we should have failed the migration and rolled back. when we recive an invalid operation error form libvirt and called find_job_type we really shoudl end up taking the exception path and return libvirt.VIR_DOMAIN_JOB_FAILED

im not sure if backporting to queens would also solve this issue but i think that is the direction we shoudl go to adress this.

Revision history for this message
Fan Zhang (fanzhang) wrote :

I was refering to [1], when I said nova returns JobInfo(type=libvirt.VIR_DOMAIN_JOB_COMPLETED). The
key point I want to say is that in our case, we meet VIR_ERR_OPERATION_INVALID error when trying to
get job stats by self._domain.jobStats(). In [1], nova already returns JobInfo(type=libvirt.VIR_DOMAIN_JOB_COMPLETED), so post live migration is executing later.

As for the patch you mentioned, I'll check it later. Thanks.


To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.