OpenStack Compute (nova)

Bug #1841481
Comment #16

Comment 16 for bug 1841481

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-07: Related fix merged to nova (master)

#16

Reviewed: https://review.opendev.org/c/openstack/nova/+/799327
Committed: https://opendev.org/openstack/nova/commit/f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Submitter: "Zuul (22348)"
Branch: master

commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Author: Julia Kreger <email address hidden>
Date: Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost

    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.

    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.

This patch was originally authored by Arun S A G, and below is his
original commit mesage.

    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.

    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.

    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.

To reduce the chance of this race condition we call _get_node_list()
first followed by get_uuids_by_host() method.

    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <email address hidden>
    Related-Bug: #1841481

Reviewed:  https://review.opendev.org/c/openstack/nova/+/799327
Committed: https://opendev.org/openstack/nova/commit/f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Submitter: "Zuul (22348)"
Branch:    master

commit f84d5917c6fb045f03645d9f80eafbc6e5f94bdd
Author: Julia Kreger <juliaashleykreger@gmail.com>
Date:   Fri Jul 2 12:10:52 2021 -0700

[ironic] Minimize window for a resource provider to be lost
    
    This patch is based upon a downstream patch which came up in discussion
    amongst the ironic community when some operators began discussing a case
    where resource providers had disappeared from a running deployment with
    several thousand baremetal nodes.
    
    Discussion amongst operators and developers ensued and we were able
    to determine that this was still an issue in the current upstream code
    and that time difference between collecting data and then reconciling
    the records was a source of the issue. Per Arun, they have been running
    this change downstream and had not seen any reoccurances of the issue
    since the patch was applied.
    
    This patch was originally authored by Arun S A G, and below is his
    original commit mesage.
    
    An instance could be launched and scheduled to a compute node between
    get_uuids_by_host() call and _get_node_list() call. If that happens
    the ironic node.instance_uuid may not be None but the instance_uuid
    will be missing from the instance list returned by get_uuids_by_host()
    method. This is possible because _get_node_list() takes several minutes to return
    in large baremetal clusters and a lot can happen in that time.
    
    This causes the compute node to be orphaned and associated resource
    provider to be deleted from placement. Once the resource provider is
    deleted it is never created again until the service restarts. Since
    resource provider is deleted subsequent boots/rebuilds to the same
    host will fail.
    
    This behaviour is visibile in VMbooter nodes because it constantly
    launches and deletes instances there by increasing the likelihood
    of this race condition happening in large ironic clusters.
    
    To reduce the chance of this race condition we call _get_node_list()
    first followed by get_uuids_by_host() method.
    
    Change-Id: I55bde8dd33154e17bbdb3c4b0e7a83a20e8487e8
    Co-Authored-By: Arun S A G <saga@yahoo-inc.com>
    Related-Bug: #1841481