On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Low
|
Sylvain Bauza |
Bug Description
Description
===========
We have a multiple compute nodes with multiple NVIDIA GPU cards (RTX8000/RTX6000).
Nodes with a mix of RTX8000 and RTX6000 cards have 2 gpu groups configured in nova.conf but nova-compute only creates resource providers for the first gpu group.
Steps to reproduce
==================
For example, on a node with 2 RTX8000 and 1 RTX6000.
$ lspci | grep -i nvidia
21:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
81:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
e2:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
$ nvidia-smi
Thu Apr 1 17:22:53 2021
+------
| NVIDIA-SMI 460.32.04 Driver Version: 460.32.04 CUDA Version: N/A |
|------
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|======
| 0 Quadro RTX 8000 On | 00000000:21:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+------
| 1 Quadro RTX 8000 On | 00000000:81:00.0 Off | 0 |
| N/A 30C P8 27W / 250W | 285MiB / 46079MiB | 0% Default |
| | | N/A |
+------
| 2 Quadro RTX 6000 On | 00000000:E2:00.0 Off | 0 |
| N/A 30C P8 24W / 250W | 150MiB / 23039MiB | 0% Default |
| | | N/A |
+------
Extract from nova.conf :
...
[devices]
enabled_vgpu_types = nvidia-428, nvidia-387
[vgpu_nvidia-428]
device_addresses = 0000:21:
[vgpu_nvidia-387]
device_addresses = 0000:e2:00.0
When nova-compute starts, log shows :
2021-04-01 17:15:25.454 7 WARNING nova.virt.
And a listing of resource providers on this node shows that only nvidia-428 GPUs were used :
$ openstack resource provider list --os-placement-
+------
| uuid | name | generation | root_provider_uuid | parent_
+------
| f5d35bdc-
| 21a4a16e-
| 76e1ee94-
+------
In nova.conf, if I swap nvidia-428 & nvidia-387 in enabled_vgpu_types, only nvidia-387 is loaded.
Expected result
===============
All gpu groups have to be loaded (as stated in docs).
Actual result
=============
Only the first gpu group is loaded.
Environment
===========
OpenStack Victoria was deployed with kolla-ansible.
NVIDIA GRID KVM drivers: 12.1 (latest)
System: Ubuntu 20.04.2
nova-compute version: 22.2.1
Hypervisor: libvirt+KVM (libvirt 6.0.0, QEMU/KVM 4.2.1)
Storage: Dell EMC Storage Center (7.3.20.19)
Network: neutron with OVN/OVS
summary: |
- on a compute node with 3 GPUs et 2 gpu groups, nova fails to load second + On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second group config |
summary: |
- On a compute node with 3 GPUs et 2 gpu groups, nova fails to load second - group config + On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load + second group |
tags: | added: vgpu |
Looks very similar to https:/ /bugs.launchpad .net/nova/ +bug/1900006 which was fixed in Wallaby but not backported into stable/victoria.
Accepting it as valid.