Nova-compute fit instance to numa nodes not optimal resulting instance creation failure

Bug #1940668 reported by Ilya Popov
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Ilya Popov

Bug Description

Description
===========

Reproduced in ussuri, master has the same code.

When nova compute start to fit instance's NUMA topology on host's NUMA topology it uses host cells list. This list contains cell objects from cell 0 up to cell N always sorted from cell id 0 up to cell id N (N number depends on host numa node number). The only case when sort order of this list is changed is the case with instance without pci device requirement. If instance doesn't need pci specific to NUMA node, host cells list is reordered to place cells with PCI capabilities to the end of list. If all NUMA cells have PCI capabilities, list order won't changed.

This behaviour leads to attempt to place instance's first NUMA node to host NUMA node id 0 at the beginning.

If we will use huge pages and place several instances with number of NUMA nodes less when Host NUMA node number, we exhaust completely NUMA node id 0. Which will lead to instances with larger number of NUMA nodes failed to fit on this host (for example instance with NUMA nodes number equal to host NUMA node number).

To mitigate this issue, it will be better to take into account NUMA node memory usage.

May be related also to:

https://bugs.launchpad.net/nova/+bug/1738501
https://bugs.launchpad.net/nova/+bug/1887377
https://bugs.launchpad.net/nova/+bug/1893121

Steps to reproduce
==================

1. Configure OpenStack to use 2MB huge pages, allocate huge pages on compute host(let say compute 1) during boot
For ussuri it is described here: https://docs.openstack.org/nova/ussuri/admin/huge-pages.html

2. Prepare two flavors to test issue: one flavor with hw:mem_page_size='2MB', hw:numa_nodes='1',
second flavor with hw:mem_page_size='2MB', hw:numa_nodes='N', where N - number of NUMA nodes on compute we will use for testing. Compute should have NUMA node number more than 1.
Flavor's RAM should be large enough to exhaust compute NUMA node 0 RAM with some small number of instances. Lets say with 6 instances of flavor 1 we exhaust compute NUMA node 0 RAM. Flavor 2 RAM should be equal Flavor 1 RAM multiplied by N (number of numa nodes on compute 1.)

3. Start 6 instances with first flavor (with 1 NUMA node defined) on compute 1 (with availability zone hint pointed to compute 1). RAM of NUMA node 0 on host compute 1 will be exhausted
4. Try to start instance with second flavor. Instance will not been started with error "...was re-scheduled: Insufficient compute resources: Requested instance NUMA topology cannot fit the given host NUMA topology"

How it should work:
===================

We should take into account memory usage of NUMA nodes to reduce number of this kind of error. So it is needed to use first NUMA nodes with more free RAM available.

Ilya Popov (ilya-p)
description: updated
description: updated
Ilya Popov (ilya-p)
description: updated
Ilya Popov (ilya-p)
Changed in nova:
assignee: nobody → Ilya Popov (ilya-p)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/805649

Changed in nova:
status: New → In Progress
Revision history for this message
Ilya Popov (ilya-p) wrote :
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/805649
Committed: https://opendev.org/openstack/nova/commit/d13412648d011994a146dac1e7214ead3b82b31b
Submitter: "Zuul (22348)"
Branch: master

commit d13412648d011994a146dac1e7214ead3b82b31b
Author: Ilya Popov <email address hidden>
Date: Mon Aug 23 16:44:25 2021 +0300

    Fix to implement 'pack' or 'spread' VM's NUMA cells

    Cells mean NUMA cells below in text.

    By default, first instance's cell are placed to the host's cell with
    id 0, so it will be exhausted first. Than host's cell with id 1 will
    be used and exhausted. It will lead to error placing instance with
    number of cells in NUMA topology equal to host's cells number if
    some instances with one cell topology are placed on cell with id 0
    before. Fix will perform several sorts to put less used cells at
    the beginning of host_cells list based on PCI devices, memory and
    cpu usage when packing_host_numa_cells_allocation_strategy is set
    to False (so called 'spread strategy'), or will try to place all
    VM's cell to the same host's cell untill it will be completely
    exhausted and only after will start to use next available host's
    cell (so called 'pack strategy'), when the configuration option
    packing_host_numa_cells_allocation_strategy is set to True.

    Partial-Bug: #1940668
    Change-Id: I03c4db3c36a780aac19841b750ff59acd3572ec6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/829804

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/861832

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.