report extra gpu device when config one enabled_vgpu_types

Bug #1943934 reported by Wenping Song
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Low
Wenping Song

Bug Description

if there are two gpu devices virtualized on the env, and config one enabled_vgpu_types and device_addresses, Nova will report these two gpu devices to Placement. we should only report the configured device_addresses to Placement.

Tags: vgpu
Wenping Song (wenping1)
Changed in nova:
assignee: nobody → Wenping Song (wenping1)
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

That's because we don't verify the sections if we only have one reported type (and then we create inventories for all the GPUs)

For fixing it, you can use a fake vgpu type and create a new section for it, like :

enabled_mdev_types = nvidia-235,fake
[mdev_nvidia-235]
device_addresses = 0000:84:00.0,0000:85:00.0

[mdev_fake]
device_addresses = 0000:00:00.1,0000:00:00.1

That said, we could fix this issue by still verifying the section even for one reported type, but just not returning an exception if there is no section.

Changed in nova:
importance: Undecided → Low
status: New → Triaged
tags: added: vgpu
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/814743

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

This is also specifically affecting MIG-enabled nvidia GPU:

in MIG case, the card is represented as 16 VFs due to SR-IOV, but in fact can not support more that 7 of the smallest possible mdev type.
If I want to expose only one VGPU type from compute, nova will create and report all 16 VFs as resource providers to placement, leading to resource capacity over-report.
Also, when using single mdev class, one can not set a custom resource class name for it, only default 'VGPU' resource class.

That then requires an extra dance with resource provider traits that could've been avoided if we just properly parse those dynamic config sections if they exist for the case of single mdev type enabled.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.