Live Migration - if libvirt timeout the instance goes to error state but the live migration continues
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Triaged
|
High
|
Unassigned |
Bug Description
Recently we live migrated an entire cell to new hardware and we hit the following problem several times...
During a live migration Nova monitors the state of the migration quering libvirt every 0.5s
If libvirt timeout, the instance is left in a very bad state...
The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node.
I'm using Stein release, but looking into the current release the code path seems the same.
Here's the Stein trace:
```
Traceback (most recent call last):
File "/usr/lib/
block_
File "/usr/lib/
migrate_data)
File "/usr/lib/
finish_event, disk_paths)
File "/usr/lib/
info = guest.get_
File "/usr/lib/
stats = self._domain.
File "/usr/lib/
result = proxy_call(
File "/usr/lib/
rv = execute(f, *args, **kwargs)
File "/usr/lib/
six.reraise(c, e, tb)
File "/usr/lib/
rv = meth(*args, **kwargs)
File "/usr/lib64/
if ret is None: raise libvirtError ('virDomainGetJ
libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=
```
i think going into the error state is still correct unless we can somehow recover later.
do you konw if we ever get to post_live_ migration? if so then
https:/ /review. opendev. org/c/openstack /nova/+ /791135 shoudl fix it /bugs.launchpad .net/nova/ +bug/1628606
and this is likely just another example of https:/
we likely can make this more robost as a time out on a single iteration fo polling the jobs states shoudl not be sufficient to break the migration.