Nova scheduler randomly fails to schedule CPU-pinned instance-flavors with hugepages - fails increases as running instance count grows
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Low
|
Unassigned |
Bug Description
Description
===========
Isolated to single hypervisor.
Nova scheduler randomly fails to schedule CPU-pinned instance-flavors with hugepages - fails increases as running instance count grows.
Steps to reproduce
==================
1) Hypervisor with two numa-nodes, 2x Intel Gold 6126, 256GB RAM (128GB in each numa node), 61440x2M hugepages in each node. Hypervisor running nothing else than OpenStack
2) Flavor specified with:
- 4 vCPUs
- 20480 MB RAM
- hw:cpu_policy dedicated
- hw:cpu_
- hw:mem_page_size 2MB
3) Try to schedule 12 instances of the mentioned flavor
Expected result
===============
12 instances running on hypervisor, neatly packed using up all hugepages.
Actual result
=============
NUMA node 0 is full, NUMA node 1 has 2-3 instances or so. This varies from attempt to attempt.
Workaround
==========
Leave all running instances as they are, schedule more instances until the desired amount of instances have been successfully created. (It took 32 create attempts to fill all 12 slots for me)
Problem will not exist if hugepages are disabled from flavor and hypervisor.
Environment
===========
Running OpenStack Ocata, RDO packages on Centos 7.4.
Linux 3.10.0-
nova 15.0.7
Compute:
openstack-
Ctrl:
openstack-
python2-
python-
openstack-
openstack-
openstack-
openstack-
openstack-
openstack-
Using Libvirt+KVM
libvirt 3.2.0-14.el7_4 (ev)
qemu 2.9.0-16.el7_4 (ev)
Storage is pure qcow2 on /var/lib/nova
Neutron with linuxbridge-agent for networking.
tags: | added: sched |
tags: |
added: scheduler removed: sched |
tags: |
added: libvirt numa removed: scheduler |
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → Low |
Quick question here: Does it solve the problem if you set a number that is lower or equal to the number of available computes that have available resources ? IE: You have 6 computes with available resources and you use "--max-count 6" or "--count 6" ?