Ironic node rebalance race can lead to missing compute nodes in DB

Bug #1853009 reported by Mark Goddard
60
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Mark Goddard
Ocata
New
Undecided
Unassigned
Pike
New
Undecided
Unassigned
Queens
New
Undecided
Unassigned
Rocky
New
Undecided
Unassigned
Stein
New
Undecided
Unassigned
Train
In Progress
Undecided
Unassigned
Ussuri
In Progress
High
Mark Goddard

Bug Description

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to.

Steps to reproduce
==================

* Deploy nova with multiple nova-compute services managing ironic.
* Create some bare metal nodes in ironic, and make them 'available' (does not work if they are 'active')
* Stop all nova-compute services
* Wait for all nova-compute services to be DOWN in 'openstack compute service list'
* Simultaneously start all nova-compute services

Expected results
================

All ironic nodes appear as hypervisors in 'openstack hypervisor list'

Actual results
==============

One or more nodes may be missing from 'openstack hypervisor list'. This is most easily checked via 'openstack hypervisor list | wc -l'

Environment
===========

OS: CentOS 7.6
Hypervisor: ironic
Nova: 18.2.0, plus a handful of backported patches

Logs
====

I grabbed some relevant logs from one incident of this issue. They are split between two compute services, and I have tried to make that clear, including a summary of what happened at each point.

http://paste.openstack.org/show/786272/

tl;dr

c3: 19:14:55 Finds no compute record in RT. Tries to create one (_init_compute_node). Shows traceback with SQL rollback but seems to succeed
c1: 19:14:56 Finds no compute record in RT, ‘moves’ existing node from c3
c1: 19:15:54 Begins periodic update, queries compute nodes for this host, finds the node
c3: 19:15:54 Finds no compute record in RT, ‘moves’ existing node from c1
c1: 19:15:55 Deletes orphan compute node (which now belongs to c3)
c3: 19:16:56 Creates resource provider
c3; 19:17:56 Uses existing resource provider

There are two major problems here:

* c1 deletes the orphan node after c3 has taken ownership of it

* c3 assumes that another compute service will not delete its nodes. Once a node is in rt.compute_nodes, it is not removed again unless the node is orphaned

Mark Goddard (mgoddard)
Changed in nova:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/694802

Matt Riedemann (mriedem)
tags: added: ironic resource-tracker
Revision history for this message
Mark Goddard (mgoddard) wrote :

I removed the duplicate association to bug 1841481. While the symptoms are similar, I think the underlying cause is different.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/695012

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/695187

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/695188

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/695189

Revision history for this message
Vladyslav Drok (vdrok) wrote :
Download full text (10.1 KiB)

Here is what I encountered on queens during network partition (edited for easier read):

nova-compute bmt01

17:17:40,099.099 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:17:40,337.337 Final resource view: name=a689ac47-8cdb-4162-ab74-5b94f2b22144
17:17:40,526.526 Final resource view: name=feb38a0f-5299-423c-8f46-78ee120f14ee
17:18:58,783.783 compute can not report its status to conductor in nova.servicegroup.drivers.db (on object.Service.save, pymysql.err.InternalError)
17:19:07,437.437 compute fails to perform periodic update_available_resource (on objects.ComputeNode._db_compute_node_get_all_by_host, pymysql.err.InternalError)
17:19:37,444.444 compute fails to perform periodic _sync_scheduler_instance_info (MessagingTimeout in conductor RPCAPI object_class_action_versions)

<instances start moving>

17:19:45,638.638 No compute node record for bmt01:3baefd99-dbd6-40e3-88a4-dadff5ca4bb8
17:19:45,865.865 ComputeNode 3baefd99-dbd6-40e3-88a4-dadff5ca4bb8 moving from bmt03 to bmt01
17:19:51,450.450 No compute node record for bmt01:1ddc0947-541c-47e5-a77a-3dab82205c21
17:19:51,488.488 ComputeNode 1ddc0947-541c-47e5-a77a-3dab82205c21 moving from bmt03 to bmt01
17:19:57,374.374 No compute node record for bmt01:25934ddf-808f-4bb9-b0f9-55a3e3184cb3
17:19:57,491.491 ComputeNode 25934ddf-808f-4bb9-b0f9-55a3e3184cb3 moving from bmt03 to bmt01
17:19:59,425.425 nova.servicegroup.drivers.db Recovered from being unable to report status.
17:20:01,313.313 No compute node record for bmt01:cf9dd25d-0db0-410b-a91d-58b226126f01
17:20:01,568.568 ComputeNode cf9dd25d-0db0-410b-a91d-58b226126f01 moving from bmt03 to bmt01
17:20:03,513.513 No compute node record for bmt01:812fb0ba-2415-4303-a32f-1dcd6ae591d5
17:20:03,599.599 ComputeNode 812fb0ba-2415-4303-a32f-1dcd6ae591d5 moving from bmt02 to bmt01
17:20:04,717.717 No compute node record for bmt01:db58f55e-1a60-4d20-9eea-5354e2c87bc4
17:20:04,756.756 ComputeNode db58f55e-1a60-4d20-9eea-5354e2c87bc4 moving from bmt03 to bmt01
17:20:06,005.005 No compute node record for bmt01:75ae6252-74e1-4d94-b379-8b1fd3665c57
17:20:06,046.046 ComputeNode 75ae6252-74e1-4d94-b379-8b1fd3665c57 moving from bmt02 to bmt01
17:20:07,153.153 No compute node record for bmt01:787f2ff1-6146-4f6f-aba8-5b37bdb23b25
17:20:07,188.188 ComputeNode 787f2ff1-6146-4f6f-aba8-5b37bdb23b25 moving from bmt03 to bmt01
17:20:08,171.171 No compute node record for bmt01:79c76025-da4f-43ac-a544-0eb5bac76bd8
17:20:08,209.209 ComputeNode 79c76025-da4f-43ac-a544-0eb5bac76bd8 moving from bmt02 to bmt01
17:20:09,178.178 No compute node record for bmt01:ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e
17:20:09,226.226 ComputeNode ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e moving from bmt02 to bmt01
17:20:10,411.411 No compute node record for bmt01:50aec742-41fc-46eb-9cf6-6e908ee5040b
17:20:10,428.428 ComputeNode 50aec742-41fc-46eb-9cf6-6e908ee5040b moving from bmt02 to bmt01
17:20:12,168.168 No compute node record for bmt01:83de1b40-5db6-4ecf-9c18-1a83356890ae
17:20:12,195.195 ComputeNode 83de1b40-5db6-4ecf-9c18-1a83356890ae moving from bmt03 to bmt01
...
17:20:48,502.502 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:20:48,677.677 Final resour...

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Adding the 'api' tag as there is an API impact when operators want to delete the service : they end up having an exception because the ComputeNode record is gone.

Marking https://bugs.launchpad.net/nova/+bug/1860312 as a duplicate of this one as I think the root cause resolution will go by fixing the virt driver instead of workarounding the API code.

Changed in nova:
importance: Undecided → High
tags: added: api
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/695012
Committed: https://opendev.org/openstack/nova/commit/59d9871e8a0672538f8ffc43ae99b3d1c4b08909
Submitter: "Zuul (22348)"
Branch: master

commit 59d9871e8a0672538f8ffc43ae99b3d1c4b08909
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 14:45:02 2019 +0000

    Add functional regression test for bug 1853009

    Bug 1853009 describes a race condition involving multiple nova-compute
    services with ironic. As the compute services start up, the hash ring
    rebalances, and the compute services have an inconsistent view of which
    is responsible for a compute node.

    The sequence of actions here is adapted from a real world log [1], where
    multiple nova-compute services were started simultaneously. In some
    cases mocks are used to simulate race conditions.

    There are three main issues with the behaviour:

    * host2 deletes the orphan node compute node after host1 has taken
      ownership of it.

    * host1 assumes that another compute service will not delete its nodes.
      Once a node is in rt.compute_nodes, it is not removed again unless the
      node is orphaned. This prevents host1 from recreating the compute
      node.

    * host1 assumes that another compute service will not delete its
      resource providers. Once an RP is in the provider tree, it is not
      removed.

    This functional test documents the current behaviour, with the idea that
    it can be updated as this behaviour is fixed.

    [1] http://paste.openstack.org/show/786272/

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
    Related-Bug: #1853009
    Related-Bug: #1853159

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/695187
Committed: https://opendev.org/openstack/nova/commit/32676a9f45807ea8770dc7bdff1e859673af1b61
Submitter: "Zuul (22348)"
Branch: master

commit 32676a9f45807ea8770dc7bdff1e859673af1b61
Author: Stephen Finucane <email address hidden>
Date: Wed Apr 28 13:53:39 2021 +0100

    Clear rebalanced compute nodes from resource tracker

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The issue being addressed here is that if a compute node is deleted by a host
    which thinks it is an orphan, then the compute host that actually owns the node
    might not recreate it if the node is already in its resource tracker cache.

    This change fixes the issue by clearing nodes from the resource tracker cache
    for which a compute node entry does not exist. Then, when the available
    resource for the node is updated, the compute node object is not found in the
    cache and gets recreated.

    Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
    Partial-Bug: #1853009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/695188
Committed: https://opendev.org/openstack/nova/commit/2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789
Submitter: "Zuul (22348)"
Branch: master

commit 2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 16:51:01 2019 +0000

    Invalidate provider tree when compute node disappears

    There is a race condition in nova-compute with the ironic virt driver
    as nodes get rebalanced. It can lead to compute nodes being removed in
    the DB and not repopulated. Ultimately this prevents these nodes from
    being scheduled to.

    The issue being addressed here is that if a compute node is deleted by a
    host which thinks it is an orphan, then the resource provider for that
    node might also be deleted. The compute host that owns the node might
    not recreate the resource provider if it exists in the provider tree
    cache.

    This change fixes the issue by clearing resource providers from the
    provider tree cache for which a compute node entry does not exist. Then,
    when the available resource for the node is updated, the resource
    providers are not found in the cache and get recreated in placement.

    Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
    Related-Bug: #1853009
    Related-Bug: #1841481

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/694802
Committed: https://opendev.org/openstack/nova/commit/a8492e88783b40f6dc61888fada232f0d00d6acf
Submitter: "Zuul (22348)"
Branch: master

commit a8492e88783b40f6dc61888fada232f0d00d6acf
Author: Mark Goddard <email address hidden>
Date: Mon Nov 18 12:06:47 2019 +0000

    Prevent deletion of a compute node belonging to another host

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.

    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.

    Co-Authored-By: melanie witt <email address hidden>

    Closes-Bug: #1853009
    Related-Bug: #1841481

    Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/695189
Committed: https://opendev.org/openstack/nova/commit/2383cbb4a518821d245fce316b3778c8ba8e5246
Submitter: "Zuul (22348)"
Branch: master

commit 2383cbb4a518821d245fce316b3778c8ba8e5246
Author: Mark Goddard <email address hidden>
Date: Wed Nov 20 12:01:33 2019 +0000

    Fix inactive session error in compute node creation

    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:

        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.

    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.

    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.

    The new unit test added here fails without this change.

    This change marks the removal of the final FIXME from the functional
    test added in [4].

    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/

    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811805

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811806

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811808

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811810

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811811

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811813

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811814

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811815

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811816

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811817

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811818

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811819

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811821

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811822

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811823

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811824

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/811805
Committed: https://opendev.org/openstack/nova/commit/c260e75d012cc4fae596d5de185afad6fb24068c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit c260e75d012cc4fae596d5de185afad6fb24068c
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 14:45:02 2019 +0000

    Add functional regression test for bug 1853009

    Bug 1853009 describes a race condition involving multiple nova-compute
    services with ironic. As the compute services start up, the hash ring
    rebalances, and the compute services have an inconsistent view of which
    is responsible for a compute node.

    The sequence of actions here is adapted from a real world log [1], where
    multiple nova-compute services were started simultaneously. In some
    cases mocks are used to simulate race conditions.

    There are three main issues with the behaviour:

    * host2 deletes the orphan node compute node after host1 has taken
      ownership of it.

    * host1 assumes that another compute service will not delete its nodes.
      Once a node is in rt.compute_nodes, it is not removed again unless the
      node is orphaned. This prevents host1 from recreating the compute
      node.

    * host1 assumes that another compute service will not delete its
      resource providers. Once an RP is in the provider tree, it is not
      removed.

    This functional test documents the current behaviour, with the idea that
    it can be updated as this behaviour is fixed.

    [1] http://paste.openstack.org/show/786272/

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
    Related-Bug: #1853009
    Related-Bug: #1853159
    (cherry picked from commit 59d9871e8a0672538f8ffc43ae99b3d1c4b08909)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/811806
Committed: https://opendev.org/openstack/nova/commit/f950cedf17cc4c3ce9d094dbfde5e4cf013260f7
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit f950cedf17cc4c3ce9d094dbfde5e4cf013260f7
Author: Stephen Finucane <email address hidden>
Date: Wed Apr 28 13:53:39 2021 +0100

    Clear rebalanced compute nodes from resource tracker

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The issue being addressed here is that if a compute node is deleted by a host
    which thinks it is an orphan, then the compute host that actually owns the node
    might not recreate it if the node is already in its resource tracker cache.

    This change fixes the issue by clearing nodes from the resource tracker cache
    for which a compute node entry does not exist. Then, when the available
    resource for the node is updated, the compute node object is not found in the
    cache and gets recreated.

    Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
    Partial-Bug: #1853009
    (cherry picked from commit 32676a9f45807ea8770dc7bdff1e859673af1b61)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/811807
Committed: https://opendev.org/openstack/nova/commit/0fc104eeea065579f7fa9b52794d5151baefc84c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 0fc104eeea065579f7fa9b52794d5151baefc84c
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 16:51:01 2019 +0000

    Invalidate provider tree when compute node disappears

    There is a race condition in nova-compute with the ironic virt driver
    as nodes get rebalanced. It can lead to compute nodes being removed in
    the DB and not repopulated. Ultimately this prevents these nodes from
    being scheduled to.

    The issue being addressed here is that if a compute node is deleted by a
    host which thinks it is an orphan, then the resource provider for that
    node might also be deleted. The compute host that owns the node might
    not recreate the resource provider if it exists in the provider tree
    cache.

    This change fixes the issue by clearing resource providers from the
    provider tree cache for which a compute node entry does not exist. Then,
    when the available resource for the node is updated, the resource
    providers are not found in the cache and get recreated in placement.

    Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
    Related-Bug: #1853009
    Related-Bug: #1841481
    (cherry picked from commit 2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/811808
Committed: https://opendev.org/openstack/nova/commit/cbbca58504275f194ec55eeb89dad4a496d98060
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit cbbca58504275f194ec55eeb89dad4a496d98060
Author: Mark Goddard <email address hidden>
Date: Mon Nov 18 12:06:47 2019 +0000

    Prevent deletion of a compute node belonging to another host

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.

    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.

    Co-Authored-By: melanie witt <email address hidden>

    Conflicts:
        nova/db/sqlalchemy/api.py

    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.

    NOTE(melwitt): Differences from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.

    Closes-Bug: #1853009
    Related-Bug: #1841481

    Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
    (cherry picked from commit a8492e88783b40f6dc61888fada232f0d00d6acf)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/811809
Committed: https://opendev.org/openstack/nova/commit/665c053315439e1345aa131f4839945d662fb3f3
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 665c053315439e1345aa131f4839945d662fb3f3
Author: Mark Goddard <email address hidden>
Date: Wed Nov 20 12:01:33 2019 +0000

    Fix inactive session error in compute node creation

    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:

        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.

    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.

    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.

    The new unit test added here fails without this change.

    This change marks the removal of the final FIXME from the functional
    test added in [4].

    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/

    Conflicts:
        nova/db/sqlalchemy/api.py

    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.

    NOTE(melwitt): Difference from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.

    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009
    (cherry picked from commit 2383cbb4a518821d245fce316b3778c8ba8e5246)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.2.1

This issue was fixed in the openstack/nova 23.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811821
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811825
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811824
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811822
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811823
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.