Description
===========
In Ussuri, when a compute node providing vGPUs (Nvidia GRID in my case) is rebooted, the mdevs for VGPUs is not recreated, and a traceback from libvirt.libvirtError is thrown.
https://paste.ubuntu.com/p/4t4NvTHGd8/
As far as I understand, this should have been fixed in https://review.opendev.org/#/c/715489/ but it seems like it fails even before it tries to recreate the mdev.
Expected result
===============
Upon host reboot, the mdevs should be recreated and the VMs should be restarted.
Actual result
=============
nova-compute throws the aforementioned error, the mdevs are not re-created and the VMs is left in an unrecoverable state.
Environment
===========
# dnf list installed | grep nova
openstack-nova-common.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
openstack-nova-compute.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
python3-nova.noarch 1:21.1.0-2.el8 @centos-openstack-ussuri
python3-novaclient.noarch 1:17.0.0-1.el8 @centos-openstack-ussuri
# dnf list installed | grep libvirt
libvirt-bash-completion.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-client.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-config-nwfilter.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-interface.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-network.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-nodedev.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-nwfilter.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-qemu.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-secret.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-core.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-disk.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-gluster.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-iscsi.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-iscsi-direct.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-logical.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-mpath.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-rbd.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-driver-storage-scsi.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-daemon-kvm.x86_64 6.0.0-25.2.el8 @advanced-virtualization
libvirt-libs.x86_64 6.0.0-25.2.el8 @advanced-virtualization
python3-libvirt.x86_64 6.0.0-1.el8 @advanced-virtualization
Thanks for reporting !
Oh shit, you're right, we can't lookup the existing mdev info [1] to know its parent PCI device since the mdev disappeared after rebooting...
[1] https:/ /github. com/openstack/ nova/blob/ 450213f/ nova/virt/ libvirt/ driver. py#L816
So, honestly, there are no ways to know the parent PCI device since we don't persist mdevs and to be honest, that's not something Nova should do since it's a KVM/kernel issue.
Leaving the bug open until we figure out a good way to either document or fix this.