Bug #1841481 “Race during ironic re-balance corrupts local RT Pr...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Eric Fried (efried) wrote on 2019-08-26:

#1

Yup, we ran into this race in update_from_provider_tree as well, for which we made _clear_provider_cache_for_tree().

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L1330-L1341

We should invoke same when _refresh_associations fails.

...Possibly *from* _refresh_associations itself.

...Keeping in mind that that guy is recursive :)

Eric Fried (efried) on 2019-08-27

Changed in nova:
assignee:	nobody → Eric Fried (efried)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-27: Related fix proposed to nova (master)

#2

Related fix proposed to branch: master
Review: https://review.opendev.org/678957

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-27:

#3

Related fix proposed to branch: master
Review: https://review.opendev.org/678958

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-27:

#4

Related fix proposed to branch: master
Review: https://review.opendev.org/678959

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-27: Fix proposed to nova (master)

#5

Fix proposed to branch: master
Review: https://review.opendev.org/678960

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-25:

#6

I'm wondering if there is a way to recreate this with a functional test similar to https://review.opendev.org/#/c/675705/ where we'd have two compute services and a single compute node. The node would start on host1 and then we'd update the node to go from host1 to host2 in the DB and run the update_available_resource periodic on host1 which should delete the resource provider, and then run that same periodic on host2 to see if it fails. One thing we'd probably have to stub is injecting the provider into the host2 ProviderTree cache after host1 deletes the provider but before host2 tries to refresh associations for the provider, which is a bit icky but kind of the only way to recreate a race in tests.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-25:

#7

I think we can consider this a regression going back to https://review.opendev.org/#/c/526540/ which changed the _get_provider_traits method from returning None to raising an error. That doesn't mean that change was wrong, just that it's probably as early as we can say this is a recreatable (backportable) issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-25: Related fix proposed to nova (master)

#8

Related fix proposed to branch: master
Review: https://review.opendev.org/684840

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-25:

#9

Arguably the #1 fix about removing the entry from the ResourceTracker.compute_nodes dict could go back to Ocata because this is in Ocata:

https://review.opendev.org/#/q/I4253cffca3dbf558c875eed7e77711a31e9e3406

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-25: Fix proposed to nova (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/684849

Changed in nova:
assignee:	Eric Fried (efried) → Matt Riedemann (mriedem)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-10-15:

#11

Hits in ironic multinode jobs:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Skipping%20removal%20of%20allocations%20for%20deleted%20instances%3A%20Failed%20to%20retrieve%20allocations%20for%20resource%20provider%5C%22%20AND%20message%3A%5C%22No%20resource%20provider%20with%20uuid%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22%20AND%20project%3A%5C%22openstack%2Fironic%5C%22&from=7d

We don't have an elastic-recheck query for that since none of the jobs it hits on are voting.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-20: Related fix proposed to nova (master)

#12

Related fix proposed to branch: master
Review: https://review.opendev.org/695188

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-21: Change abandoned on nova (master)

#13

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/684840
Reason: Let's just go with Mark's series here:

https://review.opendev.org/#/q/topic:bug/1839560+status:open

I've lost context on the bug and fixes and likely not going to drive any of this forward so we'll just go with Mark's changes.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-21:

#14

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/684849
Reason: Let's just go with Mark's series here:

https://review.opendev.org/#/q/topic:bug/1839560+status:open

I've lost context on the bug and fixes and likely not going to drive any of this forward so we'll just go with Mark's changes.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-02: Related fix proposed to nova (master)

#15

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/799327

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-07: Related fix merged to nova (master)

#16

Reviewed: https://review.opendev.org/c/openstack/nova/+/799327
Committed: https://opendev.org/openstack/nova/commit/f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Submitter: "Zuul (22348)"
Branch: master

commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Author: Julia Kreger <email address hidden>
Date: Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost

    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.

    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.

This patch was originally authored by Arun S A G, and below is his
original commit mesage.

    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.

    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.

    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.

To reduce the chance of this race condition we call _get_node_list()
first followed by get_uuids_by_host() method.

    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <email address hidden>
    Related-Bug: #1841481

Reviewed:  https://review.opendev.org/c/openstack/nova/+/799327
Committed: https://opendev.org/openstack/nova/commit/f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Submitter: "Zuul (22348)"
Branch:    master

commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Author: Julia Kreger <juliaashleykreger@gmail.com>
Date:   Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost
    
    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.
    
    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.
    
    This patch was originally authored by Arun S A G, and below is his
    original commit mesage.
    
    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.
    
    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.
    
    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.
    
    To reduce the chance of this race condition we call _get_node_list()
    first followed by get_uuids_by_host() method.
    
    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <saga@yahoo-inc.com>
    Related-Bug: #1841481

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-07: Related fix proposed to nova (stable/wallaby)

#17

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/799772

Revision history for this message

melanie witt (melwitt) wrote on 2021-07-14:

#18

This bug seems to have the same root cause as:

https://bugs.launchpad.net/nova/+bug/1853009

which has patches under review, so I'm going to mark this bug as a duplicate of it in an attempt to reduce confusion.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-16: Related fix merged to nova (stable/wallaby)

#19

Reviewed: https://review.opendev.org/c/openstack/nova/+/799772
Committed: https://opendev.org/openstack/nova/commit/0c36bd28ebd05ec0b1dbae950a24a2ecf339be00
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 0c36bd28ebd05ec0b1dbae950a24a2ecf339be00
Author: Julia Kreger <email address hidden>
Date: Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost

    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.

    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.

This patch was originally authored by Arun S A G, and below is his
original commit mesage.

    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.

    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.

    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.

To reduce the chance of this race condition we call _get_node_list()
first followed by get_uuids_by_host() method.

    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <email address hidden>
    Related-Bug: #1841481
    (cherry picked from commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd)

Reviewed:  https://review.opendev.org/c/openstack/nova/+/799772
Committed: https://opendev.org/openstack/nova/commit/0c36bd28ebd05ec0b1dbae950a24a2ecf339be00
Submitter: "Zuul (22348)"
Branch:    stable/wallaby

commit 0c36bd28ebd05ec0b1dbae950a24a2ecf339be00
Author: Julia Kreger <juliaashleykreger@gmail.com>
Date:   Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost
    
    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.
    
    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.
    
    This patch was originally authored by Arun S A G, and below is his
    original commit mesage.
    
    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.
    
    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.
    
    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.
    
    To reduce the chance of this race condition we call _get_node_list()
    first followed by get_uuids_by_host() method.
    
    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <saga@yahoo-inc.com>
    Related-Bug: #1841481
    (cherry picked from commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd)

tags:

added: in-stable-wallaby

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-16: Related fix proposed to nova (stable/victoria)

#20

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/800873

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-20: Related fix merged to nova (master)

#21

Reviewed: https://review.opendev.org/c/openstack/nova/+/695188
Committed: https://opendev.org/openstack/nova/commit/2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789
Submitter: "Zuul (22348)"
Branch: master

commit 2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 16:51:01 2019 +0000

Invalidate provider tree when compute node disappears

    There is a race condition in nova-compute with the ironic virt driver
    as nodes get rebalanced. It can lead to compute nodes being removed in
    the DB and not repopulated. Ultimately this prevents these nodes from
    being scheduled to.

    The issue being addressed here is that if a compute node is deleted by a
    host which thinks it is an orphan, then the resource provider for that
    node might also be deleted. The compute host that owns the node might
    not recreate the resource provider if it exists in the provider tree
    cache.

    This change fixes the issue by clearing resource providers from the
    provider tree cache for which a compute node entry does not exist. Then,
    when the available resource for the node is updated, the resource
    providers are not found in the cache and get recreated in placement.

    Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
    Related-Bug: #1853009
    Related-Bug: #1841481

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-30:

#22

Reviewed: https://review.opendev.org/c/openstack/nova/+/694802
Committed: https://opendev.org/openstack/nova/commit/a8492e88783b40f6dc61888fada232f0d00d6acf
Submitter: "Zuul (22348)"
Branch: master

commit a8492e88783b40f6dc61888fada232f0d00d6acf
Author: Mark Goddard <email address hidden>
Date: Mon Nov 18 12:06:47 2019 +0000

Prevent deletion of a compute node belonging to another host

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.

    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.

Co-Authored-By: melanie witt <email address hidden>

Closes-Bug: #1853009
Related-Bug: #1841481

Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-29: Related fix proposed to nova (stable/wallaby)

#23

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811807

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-29:

#24

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811808

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/victoria)

#25

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811812

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30:

#26

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811813

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/ussuri)

#27

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811817

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30:

#28

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811818

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/train)

#29

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811823

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30:

#30

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811824

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-04: Related fix merged to nova (stable/wallaby)

#31

Reviewed: https://review.opendev.org/c/openstack/nova/+/811807
Committed: https://opendev.org/openstack/nova/commit/0fc104eeea065579f7fa9b52794d5151baefc84c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 0fc104eeea065579f7fa9b52794d5151baefc84c
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 16:51:01 2019 +0000

Invalidate provider tree when compute node disappears

    There is a race condition in nova-compute with the ironic virt driver
    as nodes get rebalanced. It can lead to compute nodes being removed in
    the DB and not repopulated. Ultimately this prevents these nodes from
    being scheduled to.

    The issue being addressed here is that if a compute node is deleted by a
    host which thinks it is an orphan, then the resource provider for that
    node might also be deleted. The compute host that owns the node might
    not recreate the resource provider if it exists in the provider tree
    cache.

    This change fixes the issue by clearing resource providers from the
    provider tree cache for which a compute node entry does not exist. Then,
    when the available resource for the node is updated, the resource
    providers are not found in the cache and get recreated in placement.

    Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
    Related-Bug: #1853009
    Related-Bug: #1841481
    (cherry picked from commit 2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-04:

#32

Reviewed: https://review.opendev.org/c/openstack/nova/+/811808
Committed: https://opendev.org/openstack/nova/commit/cbbca58504275f194ec55eeb89dad4a496d98060
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit cbbca58504275f194ec55eeb89dad4a496d98060
Author: Mark Goddard <email address hidden>
Date: Mon Nov 18 12:06:47 2019 +0000

Prevent deletion of a compute node belonging to another host

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.

    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.

Co-Authored-By: melanie witt <email address hidden>

Conflicts:
nova/db/sqlalchemy/api.py

    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.

    NOTE(melwitt): Differences from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.

Closes-Bug: #1853009
Related-Bug: #1841481

Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
(cherry picked from commit a8492e88783b40f6dc61888fada232f0d00d6acf)

Reviewed:  https://review.opendev.org/c/openstack/nova/+/811808
Committed: https://opendev.org/openstack/nova/commit/cbbca58504275f194ec55eeb89dad4a496d98060
Submitter: "Zuul (22348)"
Branch:    stable/wallaby

commit cbbca58504275f194ec55eeb89dad4a496d98060
Author: Mark Goddard <mark@stackhpc.com>
Date:   Mon Nov 18 12:06:47 2019 +0000

Prevent deletion of a compute node belonging to another host
    
    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.
    
    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.
    
    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.
    
    Co-Authored-By: melanie witt <melwittt@gmail.com>
    
    Conflicts:
        nova/db/sqlalchemy/api.py
    
    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.
    
    NOTE(melwitt): Differences from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.
    
    Closes-Bug: #1853009
    Related-Bug: #1841481
    
    Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
    (cherry picked from commit a8492e88783b40f6dc61888fada232f0d00d6acf)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-17: Related fix merged to nova (stable/victoria)

#33

Reviewed: https://review.opendev.org/c/openstack/nova/+/800873
Committed: https://opendev.org/openstack/nova/commit/bc5fc2bc688056bc18cf3ae581d8e23592d110da
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit bc5fc2bc688056bc18cf3ae581d8e23592d110da
Author: Julia Kreger <email address hidden>
Date: Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost

    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.

    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.

This patch was originally authored by Arun S A G, and below is his
original commit mesage.

    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.

    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.

    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.

To reduce the chance of this race condition we call _get_node_list()
first followed by get_uuids_by_host() method.

    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <email address hidden>
    Related-Bug: #1841481
    (cherry picked from commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd)
    (cherry picked from commit 0c36bd28ebd05ec0b1dbae950a24a2ecf339be00)

Reviewed:  https://review.opendev.org/c/openstack/nova/+/800873
Committed: https://opendev.org/openstack/nova/commit/bc5fc2bc688056bc18cf3ae581d8e23592d110da
Submitter: "Zuul (22348)"
Branch:    stable/victoria

commit bc5fc2bc688056bc18cf3ae581d8e23592d110da
Author: Julia Kreger <juliaashleykreger@gmail.com>
Date:   Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost
    
    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.
    
    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.
    
    This patch was originally authored by Arun S A G, and below is his
    original commit mesage.
    
    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.
    
    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.
    
    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.
    
    To reduce the chance of this race condition we call _get_node_list()
    first followed by get_uuids_by_host() method.
    
    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <saga@yahoo-inc.com>
    Related-Bug: #1841481
    (cherry picked from commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd)
    (cherry picked from commit 0c36bd28ebd05ec0b1dbae950a24a2ecf339be00)

tags:

added: in-stable-victoria

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-17: Related fix proposed to nova (stable/ussuri)

#34

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/853540

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-17: Related fix proposed to nova (stable/train)

#35

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/853546

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-23: Related fix merged to nova (stable/ussuri)

#36

Reviewed: https://review.opendev.org/c/openstack/nova/+/853540
Committed: https://opendev.org/openstack/nova/commit/67be896e0f70ac3f4efc4c87fc03395b7029e345
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 67be896e0f70ac3f4efc4c87fc03395b7029e345
Author: Julia Kreger <email address hidden>
Date: Fri Jul 2 12:10:52 2021 -0700