Bug #1628168 “Can't assign system with multiple GPUs to differen...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Kevin (kvasko) wrote on 2016-09-27:

#1

Was doing some more investigating and found this in the nova-all.log. This looks to like an issue like the device (0f:00.0) is busy, however it shouldn't be as the only one in use *should* be 10:00.0.

All devices seem to be claimed by pci-stub which from my understanding indicates that they can't be claimed by the current running OS.

<179>Sep 27 18:53:48 node-13 nova-conductor: 2016-09-27 18:53:48.631 24595 ERROR nova.scheduler.utils [req-dfd5dfe7-ea36-4ce0-8fe7-2412df59db20 11a8bdff50d34c64b2a9fc2b477af74b 81d1532551c2436793417cd7ef0abf35 - - -] [instance: e5fadc3b-6fab-4524-9a35-c8ac954014bd] Error from last host: cirrascale1 (node cirrascale1): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1926, in _do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2116, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance e5fadc3b-6fab-4524-9a35-c8ac954014bd was re-scheduled: internal error: process exited while connecting to monitor: 2016-09-27T18:53:46.506916Z qemu-system-x86_64: -device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5: vfio: Error: Failed to setup INTx fd: Device or resource busy\n2016-09-27T18:53:46.507929Z qemu-system-x86_64: -device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5: Device initialization failed\n2016-09-27T18:53:46.507952Z qemu-system-x86_64: -device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5: Device 'vfio-pci' could not be initialized\n\n"]

Revision history for this message

Kevin (kvasko) wrote on 2016-09-27:

#2

So a little more information. I was able to get more than 1 VM to start with a GPU attached (e.g. I had 2 VMs, each had 1 GPU attached). I restarted the host VM with the GPUs.

It appears that some of the GPUs are getting into an "in-use" state and won't return.

On the host system that has the GPUs when I reboot the machine and use the command lspci -vnnn | grep VGA, all 8 GPUs show up as the following:

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

This is with 0 VM instances running that have a GPU associated with them.

At this point after a fresh reboot I started and stopped multiple VMs (started 3x VMs each with 1 GPU attached). Stopped them, and started them back up. No issues. I did that a few more times and then randomly I saw this appear when running lspci -vnnn | grep VGA on one of the cards.

0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)

I've got 2 machines running with a GPU attached, now at this point any time I try to start another VM with a GPU I get the no hosts found error. So what I *think* is happening is.

After rebooting the host machine none of the GPUS are in that weird (prog-if ff) state. At that point the VMs start up fine with a GPU, until one of the GPUs go into that "(rev ff) (prog-if ff) state. At that point any time OS tries to schedule a new VM to be created it is trying to use the GPU that is "(rev ff) (prog-if)", since it is marked as available in the MySQL database. At that point no other VMs can be created with a VM.

Whatever is causing the GPUs to go into the (rev ff) (prog-if ff) state I'm not sure. All I am doing is creating the VM, seeing if it launches successfully, logging into it, making sure the VM has a GPU associated with the VM and then deleting it from OS.

I'm using the CentOS7 image to test with from here. http://docs.openstack.org/image-guide/obtain-images.html

I'm going to try to debug this issue some more to see if I can narrow down the cause of the cards going into that odd state.