compareHypervisorCPU() incompatibility during live migration
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Triaged
|
Medium
|
Unassigned |
Bug Description
Description
===========
Live migration fails with
Refer to http://
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
[...]
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.
If skip_cpu_
Steps to reproduce
==================
* boot a simple cirros VM
* openstack server migrate --live --block-migration <vm>
Environment
===========
OpenStack: 2023.1.
libvirt version: 9.5.0
QEMU: 8.1.0
Hypervisors: two centos stream 9 VMs with nested KVM enabled
nova compute is configured with cpu_mode=host-model
Triage
======
During the pre_live_migration check running on the destination node nova sees that in the DB the guest has no vcpu_model set and therefore falls back to do host CPU model based comparison[1]. The host cpu_info used there is collected with the getCapabilities() from libvirt [2]. And in this system that returns SandyBridge. In the other hand the guest VM is running as Broadwell (note nova is configured with cpu_mode=
There are two reasons for the failure:
1) nova uses getCapabilities() to determine the host CPU model but use the model from the domCapabilities for the guest VM using host-model. According to the libvirt maintainers nova should never use getCapabilities for anything any more.
2) nova falls back to do a host CPU based comparison if the guest vcpu_model is not filled in the nova DB. But for live migration the guest CPU model should be available as the guest exists and running on the source node.
[1] https:/
[2] https:/
tags: | added: libvirt live-migration |
setting this to medium instead of high as we have a workaround config option that will provide an escape hatch for those how encounter this on upgrade however this is important to fix and backport.
the bug is cause because we are using a different method to select the gpus CPU when creating the VM vs when live migrating the VM
we should use the same method in both cases and since we have moved to using domcaps as the new way for gust CPU selection we should consolidate on that for CPU compare too.