libvirt live migration sometimes fails with "libvirt.libvirtError: internal error: migration was active, but no RAM info was set"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
melanie witt | ||
Train |
In Progress
|
Undecided
|
Unassigned | ||
Ussuri |
In Progress
|
Undecided
|
Unassigned | ||
Victoria |
In Progress
|
Undecided
|
Unassigned | ||
Wallaby |
In Progress
|
Undecided
|
Unassigned | ||
Xena |
Fix Released
|
Undecided
|
Unassigned | ||
Yoga |
Fix Released
|
Undecided
|
Unassigned | ||
Zed |
Fix Released
|
Undecided
|
Unassigned | ||
Ubuntu Cloud Archive |
New
|
Undecided
|
Unassigned | ||
Ussuri |
New
|
Undecided
|
Unassigned | ||
Victoria |
New
|
Undecided
|
Unassigned | ||
Wallaby |
New
|
Undecided
|
Unassigned | ||
Xena |
Fix Released
|
Undecided
|
Unassigned | ||
Yoga |
Fix Released
|
Undecided
|
Unassigned | ||
Zed |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
We have seen this downstream where live migration randomly fails with the following error [1]:
libvirt.
Discussion on [1] gravitated toward a possible race condition issue in qemu around the query-migrate command [2]. The query-migrate command is used (indirectly) by the libvirt driver during monitoring of live migrations [3][4][5].
While searching for info about this error, I found a thread on libvir-list from the past [6] where someone else encountered the same error and for them it happened if they called query-migrate *after* a live migration had completed.
Based on this, it seemed possible that our live migration monitoring thread sometimes races and calls jobStats() after the migration has completed, resulting in this error being raised and the migration being considered failed when it was actually complete.
A patch has since been proposed and committed [7] to address the possible issue.
Meanwhile, on our side in nova, we can mitigate this problematic behavior by catching the specific error from libvirt and ignoring it so that a live migration in this situation will be considered completed by the libvirt driver.
Doing this would improve the experience for users that are hitting this error and getting erroneous live migration failures.
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
Fix proposed to branch: master /review. opendev. org/c/openstack /nova/+ /852002
Review: https:/