Bug #1853009 “Ironic node rebalance race can lead to missing com...” : Series ussuri : Bugs : OpenStack Compute (nova)

Mark Goddard (mgoddard) on 2019-11-18

Changed in nova:
assignee:	nobody → Mark Goddard (mgoddard)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-18: Fix proposed to nova (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/694802

Matt Riedemann (mriedem) on 2019-11-18

tags:

added: ironic resource-tracker

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-11-19:

#2

I removed the duplicate association to bug 1841481. While the symptoms are similar, I think the underlying cause is different.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-19: Related fix proposed to nova (master)

#3

Related fix proposed to branch: master
Review: https://review.opendev.org/695012

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-20: Fix proposed to nova (master)

#4

Fix proposed to branch: master
Review: https://review.opendev.org/695187

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-20: Related fix proposed to nova (master)

#5

Related fix proposed to branch: master
Review: https://review.opendev.org/695188

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-20:

#6

Related fix proposed to branch: master
Review: https://review.opendev.org/695189

Revision history for this message

Vladyslav Drok (vdrok) wrote on 2020-04-03:

#7

Download full text (10.1 KiB)

Here is what I encountered on queens during network partition (edited for easier read):

nova-compute bmt01

17:17:40,099.099 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:17:40,337.337 Final resource view: name=a689ac47-8cdb-4162-ab74-5b94f2b22144
17:17:40,526.526 Final resource view: name=feb38a0f-5299-423c-8f46-78ee120f14ee
17:18:58,783.783 compute can not report its status to conductor in nova.servicegroup.drivers.db (on object.Service.save, pymysql.err.InternalError)
17:19:07,437.437 compute fails to perform periodic update_available_resource (on objects.ComputeNode._db_compute_node_get_all_by_host, pymysql.err.InternalError)
17:19:37,444.444 compute fails to perform periodic _sync_scheduler_instance_info (MessagingTimeout in conductor RPCAPI object_class_action_versions)

17:19:45,638.638 No compute node record for bmt01:3baefd99-dbd6-40e3-88a4-dadff5ca4bb8
17:19:45,865.865 ComputeNode 3baefd99-dbd6-40e3-88a4-dadff5ca4bb8 moving from bmt03 to bmt01
17:19:51,450.450 No compute node record for bmt01:1ddc0947-541c-47e5-a77a-3dab82205c21
17:19:51,488.488 ComputeNode 1ddc0947-541c-47e5-a77a-3dab82205c21 moving from bmt03 to bmt01
17:19:57,374.374 No compute node record for bmt01:25934ddf-808f-4bb9-b0f9-55a3e3184cb3
17:19:57,491.491 ComputeNode 25934ddf-808f-4bb9-b0f9-55a3e3184cb3 moving from bmt03 to bmt01
17:19:59,425.425 nova.servicegroup.drivers.db Recovered from being unable to report status.
17:20:01,313.313 No compute node record for bmt01:cf9dd25d-0db0-410b-a91d-58b226126f01
17:20:01,568.568 ComputeNode cf9dd25d-0db0-410b-a91d-58b226126f01 moving from bmt03 to bmt01
17:20:03,513.513 No compute node record for bmt01:812fb0ba-2415-4303-a32f-1dcd6ae591d5
17:20:03,599.599 ComputeNode 812fb0ba-2415-4303-a32f-1dcd6ae591d5 moving from bmt02 to bmt01
17:20:04,717.717 No compute node record for bmt01:db58f55e-1a60-4d20-9eea-5354e2c87bc4
17:20:04,756.756 ComputeNode db58f55e-1a60-4d20-9eea-5354e2c87bc4 moving from bmt03 to bmt01
17:20:06,005.005 No compute node record for bmt01:75ae6252-74e1-4d94-b379-8b1fd3665c57
17:20:06,046.046 ComputeNode 75ae6252-74e1-4d94-b379-8b1fd3665c57 moving from bmt02 to bmt01
17:20:07,153.153 No compute node record for bmt01:787f2ff1-6146-4f6f-aba8-5b37bdb23b25
17:20:07,188.188 ComputeNode 787f2ff1-6146-4f6f-aba8-5b37bdb23b25 moving from bmt03 to bmt01
17:20:08,171.171 No compute node record for bmt01:79c76025-da4f-43ac-a544-0eb5bac76bd8
17:20:08,209.209 ComputeNode 79c76025-da4f-43ac-a544-0eb5bac76bd8 moving from bmt02 to bmt01
17:20:09,178.178 No compute node record for bmt01:ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e
17:20:09,226.226 ComputeNode ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e moving from bmt02 to bmt01
17:20:10,411.411 No compute node record for bmt01:50aec742-41fc-46eb-9cf6-6e908ee5040b
17:20:10,428.428 ComputeNode 50aec742-41fc-46eb-9cf6-6e908ee5040b moving from bmt02 to bmt01
17:20:12,168.168 No compute node record for bmt01:83de1b40-5db6-4ecf-9c18-1a83356890ae
17:20:12,195.195 ComputeNode 83de1b40-5db6-4ecf-9c18-1a83356890ae moving from bmt03 to bmt01
...
17:20:48,502.502 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:20:48,677.677 Final resour...

Here is what I encountered on queens during network partition (edited for easier read):

nova-compute bmt01

17:17:40,099.099 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:17:40,337.337 Final resource view: name=a689ac47-8cdb-4162-ab74-5b94f2b22144
17:17:40,526.526 Final resource view: name=feb38a0f-5299-423c-8f46-78ee120f14ee
17:18:58,783.783 compute can not report its status to conductor in nova.servicegroup.drivers.db (on object.Service.save, pymysql.err.InternalError)
17:19:07,437.437 compute fails to perform periodic update_available_resource (on objects.ComputeNode._db_compute_node_get_all_by_host, pymysql.err.InternalError)
17:19:37,444.444 compute fails to perform periodic _sync_scheduler_instance_info (MessagingTimeout in conductor RPCAPI object_class_action_versions)

17:19:45,638.638 No compute node record for bmt01:3baefd99-dbd6-40e3-88a4-dadff5ca4bb8
17:19:45,865.865 ComputeNode 3baefd99-dbd6-40e3-88a4-dadff5ca4bb8 moving from bmt03 to bmt01
17:19:51,450.450 No compute node record for bmt01:1ddc0947-541c-47e5-a77a-3dab82205c21
17:19:51,488.488 ComputeNode 1ddc0947-541c-47e5-a77a-3dab82205c21 moving from bmt03 to bmt01
17:19:57,374.374 No compute node record for bmt01:25934ddf-808f-4bb9-b0f9-55a3e3184cb3
17:19:57,491.491 ComputeNode 25934ddf-808f-4bb9-b0f9-55a3e3184cb3 moving from bmt03 to bmt01
17:19:59,425.425 nova.servicegroup.drivers.db Recovered from being unable to report status.
17:20:01,313.313 No compute node record for bmt01:cf9dd25d-0db0-410b-a91d-58b226126f01
17:20:01,568.568 ComputeNode cf9dd25d-0db0-410b-a91d-58b226126f01 moving from bmt03 to bmt01
17:20:03,513.513 No compute node record for bmt01:812fb0ba-2415-4303-a32f-1dcd6ae591d5
17:20:03,599.599 ComputeNode 812fb0ba-2415-4303-a32f-1dcd6ae591d5 moving from bmt02 to bmt01
17:20:04,717.717 No compute node record for bmt01:db58f55e-1a60-4d20-9eea-5354e2c87bc4
17:20:04,756.756 ComputeNode db58f55e-1a60-4d20-9eea-5354e2c87bc4 moving from bmt03 to bmt01
17:20:06,005.005 No compute node record for bmt01:75ae6252-74e1-4d94-b379-8b1fd3665c57
17:20:06,046.046 ComputeNode 75ae6252-74e1-4d94-b379-8b1fd3665c57 moving from bmt02 to bmt01
17:20:07,153.153 No compute node record for bmt01:787f2ff1-6146-4f6f-aba8-5b37bdb23b25
17:20:07,188.188 ComputeNode 787f2ff1-6146-4f6f-aba8-5b37bdb23b25 moving from bmt03 to bmt01
17:20:08,171.171 No compute node record for bmt01:79c76025-da4f-43ac-a544-0eb5bac76bd8
17:20:08,209.209 ComputeNode 79c76025-da4f-43ac-a544-0eb5bac76bd8 moving from bmt02 to bmt01
17:20:09,178.178 No compute node record for bmt01:ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e
17:20:09,226.226 ComputeNode ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e moving from bmt02 to bmt01
17:20:10,411.411 No compute node record for bmt01:50aec742-41fc-46eb-9cf6-6e908ee5040b
17:20:10,428.428 ComputeNode 50aec742-41fc-46eb-9cf6-6e908ee5040b moving from bmt02 to bmt01
17:20:12,168.168 No compute node record for bmt01:83de1b40-5db6-4ecf-9c18-1a83356890ae
17:20:12,195.195 ComputeNode 83de1b40-5db6-4ecf-9c18-1a83356890ae moving from bmt03 to bmt01
...
17:20:48,502.502 Final resource view: name=597d7aac-10c3-498b-8311-a3a802feb8ac
17:20:48,677.677 Final resource view: name=a689ac47-8cdb-4162-ab74-5b94f2b22144
17:20:48,863.863 Final resource view: name=feb38a0f-5299-423c-8f46-78ee120f14ee
17:20:48,909.909 Deleting orphan compute node 85 hypervisor host is 25934ddf-808f-4bb9-b0f9-55a3e3184cb3, nodes are set(<3 from right above>)
17:20:50,659.659 Deleted resource provider 9f5fac16-2fbf-439b-b355-dcd77df54627  # Resource providers were created by respective compute nodes long ago
17:20:50,660.660 Deleting orphan compute node 91 hypervisor host is 1ddc0947-541c-47e5-a77a-3dab82205c21, 
17:20:51,656.656 Deleted resource provider fa3c90a3-b9f1-41ff-b29d-41e55a42fb12
17:20:51,657.657 Deleting orphan compute node 94 hypervisor host is 3baefd99-dbd6-40e3-88a4-dadff5ca4bb8,
17:20:52,777.777 Deleted resource provider fcb784f7-b7e3-468c-b444-0e41cb434248
17:20:52,778.778 Deleting orphan compute node 109 hypervisor host is db58f55e-1a60-4d20-9eea-5354e2c87bc4,
17:20:54,109.109 Deleted resource provider 01e4efa1-b6c7-4597-8fc6-d70255014255
17:20:54,110.110 Deleting orphan compute node 112 hypervisor host is 787f2ff1-6146-4f6f-aba8-5b37bdb23b25,
17:20:54,820.820 Deleted resource provider d6965ae8-1fa7-413d-80d2-4fe063680048
17:20:54,821.821 Deleting orphan compute node 115 hypervisor host is 50aec742-41fc-46eb-9cf6-6e908ee5040b,
17:20:56,698.698 Deleted resource provider e8d0e304-3768-4c52-bdc9-8b780c781dec
17:20:56,699.699 Deleting orphan compute node 118 hypervisor host is ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e,
17:20:58,180.180 Deleted resource provider 59a7890f-808d-4364-9cc1-8ebd42b713c1
17:20:58,181.181 Deleting orphan compute node 121 hypervisor host is 79c76025-da4f-43ac-a544-0eb5bac76bd8,
17:20:59,653.653 Deleted resource provider ef59a852-68c8-450b-8c33-e36b614a5807
17:20:59,654.654 Deleting orphan compute node 124 hypervisor host is 75ae6252-74e1-4d94-b379-8b1fd3665c57,
17:20:59,877.877 Deleted resource provider d0f16686-3796-45e4-afd0-f29664fa653e
17:20:59,878.878 Deleting orphan compute node 127 hypervisor host is 83de1b40-5db6-4ecf-9c18-1a83356890ae,
17:21:01,113.113 Deleted resource provider c75508d1-c5d8-4185-896c-a0092cd02224
17:21:01,114.114 Deleting orphan compute node 133 hypervisor host is 812fb0ba-2415-4303-a32f-1dcd6ae591d5,
17:21:01,600.600 Deleted resource provider 374158d1-16b1-4f73-98e5-fa49e68ed660
17:21:01,601.601 Deleting orphan compute node 139 hypervisor host is cf9dd25d-0db0-410b-a91d-58b226126f01,
17:21:02,937.937 Deleted resource provider 02b0dae5-dd76-4f4d-9dc7-7ae624213806
17:22:43,108.108 Failed to retrieve aggregates from placement API for resource provider with UUID 740def25-d1eb-49bc-b462-c675632fba1f. Got 404

nova-compute bmt02

17:18:16,345.345 Final resource view: name=79c76025-da4f-43ac-a544-0eb5bac76bd8
17:18:16,606.606 Final resource view: name=812fb0ba-2415-4303-a32f-1dcd6ae591d5
17:18:16,828.828 Final resource view: name=5b9c6cea-eaaf-4b1e-873d-bafa335ae166
17:18:16,977.977 Final resource view: name=75ae6252-74e1-4d94-b379-8b1fd3665c57
17:18:17,113.113 Final resource view: name=ffb3dd3b-f8f9-448f-9cb0-e1e22b996f5e
17:18:17,274.274 Final resource view: name=50aec742-41fc-46eb-9cf6-6e908ee5040b
17:19:01,673.673 A recoverable connection/channel error occurred, trying to reconnect
17:19:18,144.144 AMQP server 172.17.106.41:5672 closed the connection.
17:19:19,168.168 Reconnected to AMQP server on 172.17.106.41:5672
17:19:19,318.318 nova.servicegroup.drivers.db Lost connection to nova-conductor for reporting service status.: MessagingTimeout
17:19:24,715.715 AMQP server on 172.17.106.43:5672 is unreachable
17:19:26,280.280 Error during ComputeManager._heal_instance_info_cache: MessagingTimeout
17:19:26,825.825 Reconnected to AMQP server on 172.17.106.41:5672
17:19:33,000.000 AMQP server on 172.17.106.42:5672 is unreachable
17:19:34,014.014 Reconnected to AMQP server on 172.17.106.42:5672
17:19:41,356.356 A recoverable connection/channel error occurred, trying to reconnect
17:19:41,707.707 Unexpected error during heartbeat thread processing, retrying...

17:19:49,772.772 No compute node record for bmt02:597d7aac-10c3-498b-8311-a3a802feb8ac
17:19:49,814.814 ComputeNode 597d7aac-10c3-498b-8311-a3a802feb8ac moving from bmt01 to bmt02
17:19:50,505.505 nova.servicegroup.drivers.db Recovered from being unable to report status.
...
17:21:19,113.113 Deleting orphan compute node 88 hypervisor host is 597d7aac-10c3-498b-8311-a3a802feb8ac,
17:21:19,431.431 Deleted resource provider 740def25-d1eb-49bc-b462-c675632fba1f  # Resource providers were created by respective compute nodes long ago
17:23:20,417.417 Failed to retrieve aggregates from placement API for resource provider with UUID ef59a852-68c8-450b-8c33-e36b614a5807. Got 404

nova-compute bmt03

17:17:50,177.177 Final resource view: name=3baefd99-dbd6-40e3-88a4-dadff5ca4bb8
17:17:50,359.359 Final resource view: name=1ddc0947-541c-47e5-a77a-3dab82205c21
17:17:50,532.532 Final resource view: name=25934ddf-808f-4bb9-b0f9-55a3e3184cb3
17:17:50,657.657 Final resource view: name=cf9dd25d-0db0-410b-a91d-58b226126f01
17:17:50,792.792 Final resource view: name=db58f55e-1a60-4d20-9eea-5354e2c87bc4
17:17:50,938.938 Final resource view: name=787f2ff1-6146-4f6f-aba8-5b37bdb23b25
17:17:51,084.084 Final resource view: name=83de1b40-5db6-4ecf-9c18-1a83356890ae
17:19:00,797.797 AMQP server on 172.17.106.41:5672 is unreachable
17:19:06,807.807 AMQP server on 172.17.106.41:5672 is unreachable
17:19:13,340.340 AMQP server 172.17.106.42:5672 closed the connection.
17:19:14,138.138 Reconnected to AMQP server on 172.17.106.42:5672
17:19:14,361.361 Reconnected to AMQP server on 172.17.106.42:5672
17:19:26,908.908 AMQP server 172.17.106.41:5672 closed the connection.
17:19:27,926.926 Reconnected to AMQP server on 172.17.106.41:5672
17:19:29,497.497 AMQP server on 172.17.106.42:5672 is unreachable
17:19:30,521.521 Reconnected to AMQP server on 172.17.106.42:5672
17:19:30,535.535 A recoverable connection/channel error occurred, trying to reconnect
17:19:31,201.201 Unexpected error during heartbeat thread processing, retrying...
17:19:32,557.557 Unexpected error during heartbeat thread processing, retrying...
17:19:44,364.364 nova.servicegroup.drivers.db Lost connection to nova-conductor for reporting service status.: MessagingTimeout
17:19:46,198.198 Error during ComputeManager.update_available_resource: MessagingTimeout (object_class_action_versions)
17:19:46,216.216 AMQP server on 172.17.106.42:5672 is unreachable
17:19:47,231.231 Reconnected to AMQP server on 172.17.106.42:5672
...
17:20:07,332.332 nova.servicegroup.drivers.db Recovered from being unable to report status.
17:20:53,372.372 No compute node record for host bmt03: ComputeHostNotFound_Remote: Compute host bmt03 could not be found.  # no compute node, only host
...
17:22:57,808.808 Failed to retrieve aggregates from placement API for resource provider with UUID fcb784f7-b7e3-468c-b444-0e41cb434248. Got 404

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-21:

#8

Adding the 'api' tag as there is an API impact when operators want to delete the service : they end up having an exception because the ComputeNode record is gone.

Marking https://bugs.launchpad.net/nova/+bug/1860312 as a duplicate of this one as I think the root cause resolution will go by fixing the virt driver instead of workarounding the API code.

Changed in nova:
importance:	Undecided → High
tags:	added: api

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-20: Related fix merged to nova (master)

#9

Reviewed: https://review.opendev.org/c/openstack/nova/+/695012
Committed: https://opendev.org/openstack/nova/commit/59d9871e8a0672538f8ffc43ae99b3d1c4b08909
Submitter: "Zuul (22348)"
Branch: master

commit 59d9871e8a0672538f8ffc43ae99b3d1c4b08909
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 14:45:02 2019 +0000

Add functional regression test for bug 1853009

    Bug 1853009 describes a race condition involving multiple nova-compute
    services with ironic. As the compute services start up, the hash ring
    rebalances, and the compute services have an inconsistent view of which
    is responsible for a compute node.

    The sequence of actions here is adapted from a real world log [1], where
    multiple nova-compute services were started simultaneously. In some
    cases mocks are used to simulate race conditions.

There are three main issues with the behaviour:

* host2 deletes the orphan node compute node after host1 has taken
ownership of it.

    * host1 assumes that another compute service will not delete its nodes.
      Once a node is in rt.compute_nodes, it is not removed again unless the
      node is orphaned. This prevents host1 from recreating the compute
      node.

    * host1 assumes that another compute service will not delete its
      resource providers. Once an RP is in the provider tree, it is not
      removed.

This functional test documents the current behaviour, with the idea that
it can be updated as this behaviour is fixed.

[1] http://paste.openstack.org/show/786272/

Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
    Related-Bug: #1853009
    Related-Bug: #1853159

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-20: Fix merged to nova (master)

#10

Reviewed: https://review.opendev.org/c/openstack/nova/+/695187
Committed: https://opendev.org/openstack/nova/commit/32676a9f45807ea8770dc7bdff1e859673af1b61
Submitter: "Zuul (22348)"
Branch: master

commit 32676a9f45807ea8770dc7bdff1e859673af1b61
Author: Stephen Finucane <email address hidden>
Date: Wed Apr 28 13:53:39 2021 +0100

Clear rebalanced compute nodes from resource tracker

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The issue being addressed here is that if a compute node is deleted by a host
    which thinks it is an orphan, then the compute host that actually owns the node
    might not recreate it if the node is already in its resource tracker cache.

    This change fixes the issue by clearing nodes from the resource tracker cache
    for which a compute node entry does not exist. Then, when the available
    resource for the node is updated, the compute node object is not found in the
    cache and gets recreated.

Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
Partial-Bug: #1853009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-20: Related fix merged to nova (master)

#11

Reviewed: https://review.opendev.org/c/openstack/nova/+/695188
Committed: https://opendev.org/openstack/nova/commit/2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789
Submitter: "Zuul (22348)"
Branch: master

commit 2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 16:51:01 2019 +0000

Invalidate provider tree when compute node disappears

    There is a race condition in nova-compute with the ironic virt driver
    as nodes get rebalanced. It can lead to compute nodes being removed in
    the DB and not repopulated. Ultimately this prevents these nodes from
    being scheduled to.

    The issue being addressed here is that if a compute node is deleted by a
    host which thinks it is an orphan, then the resource provider for that
    node might also be deleted. The compute host that owns the node might
    not recreate the resource provider if it exists in the provider tree
    cache.

    This change fixes the issue by clearing resource providers from the
    provider tree cache for which a compute node entry does not exist. Then,
    when the available resource for the node is updated, the resource
    providers are not found in the cache and get recreated in placement.

    Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
    Related-Bug: #1853009
    Related-Bug: #1841481

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-30: Fix merged to nova (master)

#12

Reviewed: https://review.opendev.org/c/openstack/nova/+/694802
Committed: https://opendev.org/openstack/nova/commit/a8492e88783b40f6dc61888fada232f0d00d6acf
Submitter: "Zuul (22348)"
Branch: master

commit a8492e88783b40f6dc61888fada232f0d00d6acf
Author: Mark Goddard <email address hidden>
Date: Mon Nov 18 12:06:47 2019 +0000

Prevent deletion of a compute node belonging to another host

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.

    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.

Co-Authored-By: melanie witt <email address hidden>

Closes-Bug: #1853009
Related-Bug: #1841481

Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-08-30: Related fix merged to nova (master)

#13

Reviewed: https://review.opendev.org/c/openstack/nova/+/695189
Committed: https://opendev.org/openstack/nova/commit/2383cbb4a518821d245fce316b3778c8ba8e5246
Submitter: "Zuul (22348)"
Branch: master

commit 2383cbb4a518821d245fce316b3778c8ba8e5246
Author: Mark Goddard <email address hidden>
Date: Wed Nov 20 12:01:33 2019 +0000

Fix inactive session error in compute node creation

    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:

        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.

    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.

    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.

The new unit test added here fails without this change.

This change marks the removal of the final FIXME from the functional
test added in [4].

    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/

    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-17: Fix included in openstack/nova 24.0.0.0rc1

#14

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-29: Related fix proposed to nova (stable/wallaby)

#15

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811805

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-29: Fix proposed to nova (stable/wallaby)

#16

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811806

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-29: Related fix proposed to nova (stable/wallaby)

#17

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811807

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-29: Fix proposed to nova (stable/wallaby)

#18

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811808

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-29: Related fix proposed to nova (stable/wallaby)

#19

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811809

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/victoria)

#20

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811810

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Fix proposed to nova (stable/victoria)

#21

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811811

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/victoria)

#22

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811812

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Fix proposed to nova (stable/victoria)

#23

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811813

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/victoria)

#24

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811814

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/ussuri)

#25

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811815

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Fix proposed to nova (stable/ussuri)

#26

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811816

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/ussuri)

#27

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811817

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Fix proposed to nova (stable/ussuri)

#28

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811818

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/ussuri)

#29

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811819

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/train)

#30

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811821

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Fix proposed to nova (stable/train)

#31

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811822

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/train)

#32

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811823

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Fix proposed to nova (stable/train)

#33

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811824

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-30: Related fix proposed to nova (stable/train)

#34

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811825

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-07: Related fix merged to nova (stable/wallaby)

#35

Reviewed: https://review.opendev.org/c/openstack/nova/+/811805
Committed: https://opendev.org/openstack/nova/commit/c260e75d012cc4fae596d5de185afad6fb24068c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit c260e75d012cc4fae596d5de185afad6fb24068c
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 14:45:02 2019 +0000

Add functional regression test for bug 1853009

    Bug 1853009 describes a race condition involving multiple nova-compute
    services with ironic. As the compute services start up, the hash ring
    rebalances, and the compute services have an inconsistent view of which
    is responsible for a compute node.

    The sequence of actions here is adapted from a real world log [1], where
    multiple nova-compute services were started simultaneously. In some
    cases mocks are used to simulate race conditions.

There are three main issues with the behaviour:

* host2 deletes the orphan node compute node after host1 has taken
ownership of it.

    * host1 assumes that another compute service will not delete its nodes.
      Once a node is in rt.compute_nodes, it is not removed again unless the
      node is orphaned. This prevents host1 from recreating the compute
      node.

    * host1 assumes that another compute service will not delete its
      resource providers. Once an RP is in the provider tree, it is not
      removed.

This functional test documents the current behaviour, with the idea that
it can be updated as this behaviour is fixed.

[1] http://paste.openstack.org/show/786272/

Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
    Related-Bug: #1853009
    Related-Bug: #1853159
    (cherry picked from commit 59d9871e8a0672538f8ffc43ae99b3d1c4b08909)

tags:

added: in-stable-wallaby

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-05: Fix merged to nova (stable/wallaby)

#36

Reviewed: https://review.opendev.org/c/openstack/nova/+/811806
Committed: https://opendev.org/openstack/nova/commit/f950cedf17cc4c3ce9d094dbfde5e4cf013260f7
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit f950cedf17cc4c3ce9d094dbfde5e4cf013260f7
Author: Stephen Finucane <email address hidden>
Date: Wed Apr 28 13:53:39 2021 +0100

Clear rebalanced compute nodes from resource tracker

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The issue being addressed here is that if a compute node is deleted by a host
    which thinks it is an orphan, then the compute host that actually owns the node
    might not recreate it if the node is already in its resource tracker cache.

    This change fixes the issue by clearing nodes from the resource tracker cache
    for which a compute node entry does not exist. Then, when the available
    resource for the node is updated, the compute node object is not found in the
    cache and gets recreated.

    Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
    Partial-Bug: #1853009
    (cherry picked from commit 32676a9f45807ea8770dc7bdff1e859673af1b61)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-04: Related fix merged to nova (stable/wallaby)

#37

Reviewed: https://review.opendev.org/c/openstack/nova/+/811807
Committed: https://opendev.org/openstack/nova/commit/0fc104eeea065579f7fa9b52794d5151baefc84c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 0fc104eeea065579f7fa9b52794d5151baefc84c
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 16:51:01 2019 +0000

Invalidate provider tree when compute node disappears

    There is a race condition in nova-compute with the ironic virt driver
    as nodes get rebalanced. It can lead to compute nodes being removed in
    the DB and not repopulated. Ultimately this prevents these nodes from
    being scheduled to.

    The issue being addressed here is that if a compute node is deleted by a
    host which thinks it is an orphan, then the resource provider for that
    node might also be deleted. The compute host that owns the node might
    not recreate the resource provider if it exists in the provider tree
    cache.

    This change fixes the issue by clearing resource providers from the
    provider tree cache for which a compute node entry does not exist. Then,
    when the available resource for the node is updated, the resource
    providers are not found in the cache and get recreated in placement.

    Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
    Related-Bug: #1853009
    Related-Bug: #1841481
    (cherry picked from commit 2bb4527228c8e6fa4a1fa6cfbe80e8790e4e0789)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-04: Fix merged to nova (stable/wallaby)

#38

Reviewed: https://review.opendev.org/c/openstack/nova/+/811808
Committed: https://opendev.org/openstack/nova/commit/cbbca58504275f194ec55eeb89dad4a496d98060
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit cbbca58504275f194ec55eeb89dad4a496d98060
Author: Mark Goddard <email address hidden>
Date: Mon Nov 18 12:06:47 2019 +0000

Prevent deletion of a compute node belonging to another host

    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.

    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.

    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.

Co-Authored-By: melanie witt <email address hidden>

Conflicts:
nova/db/sqlalchemy/api.py

    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.

    NOTE(melwitt): Differences from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.

Closes-Bug: #1853009
Related-Bug: #1841481

Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
(cherry picked from commit a8492e88783b40f6dc61888fada232f0d00d6acf)

Reviewed:  https://review.opendev.org/c/openstack/nova/+/811808
Committed: https://opendev.org/openstack/nova/commit/cbbca58504275f194ec55eeb89dad4a496d98060
Submitter: "Zuul (22348)"
Branch:    stable/wallaby

commit cbbca58504275f194ec55eeb89dad4a496d98060
Author: Mark Goddard <mark@stackhpc.com>
Date:   Mon Nov 18 12:06:47 2019 +0000

Prevent deletion of a compute node belonging to another host
    
    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.
    
    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.
    
    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.
    
    Co-Authored-By: melanie witt <melwittt@gmail.com>
    
    Conflicts:
        nova/db/sqlalchemy/api.py
    
    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.
    
    NOTE(melwitt): Differences from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.
    
    Closes-Bug: #1853009
    Related-Bug: #1841481
    
    Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
    (cherry picked from commit a8492e88783b40f6dc61888fada232f0d00d6acf)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-14: Related fix merged to nova (stable/wallaby)

#39

Reviewed: https://review.opendev.org/c/openstack/nova/+/811809
Committed: https://opendev.org/openstack/nova/commit/665c053315439e1345aa131f4839945d662fb3f3
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 665c053315439e1345aa131f4839945d662fb3f3
Author: Mark Goddard <email address hidden>
Date: Wed Nov 20 12:01:33 2019 +0000

Fix inactive session error in compute node creation

    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:

        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.

    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.

    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.

The new unit test added here fails without this change.

This change marks the removal of the final FIXME from the functional
test added in [4].

    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/

Conflicts:
nova/db/sqlalchemy/api.py

    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.

    NOTE(melwitt): Difference from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.

    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009
    (cherry picked from commit 2383cbb4a518821d245fce316b3778c8ba8e5246)

Reviewed:  https://review.opendev.org/c/openstack/nova/+/811809
Committed: https://opendev.org/openstack/nova/commit/665c053315439e1345aa131f4839945d662fb3f3
Submitter: "Zuul (22348)"
Branch:    stable/wallaby

commit 665c053315439e1345aa131f4839945d662fb3f3
Author: Mark Goddard <mark@stackhpc.com>
Date:   Wed Nov 20 12:01:33 2019 +0000

Fix inactive session error in compute node creation
    
    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:
    
        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.
    
    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.
    
    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.
    
    The new unit test added here fails without this change.
    
    This change marks the removal of the final FIXME from the functional
    test added in [4].
    
    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/
    
    Conflicts:
        nova/db/sqlalchemy/api.py
    
    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.
    
    NOTE(melwitt): Difference from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.
    
    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009
    (cherry picked from commit 2383cbb4a518821d245fce316b3778c8ba8e5246)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-23: Fix included in openstack/nova 23.2.1

#40

This issue was fixed in the openstack/nova 23.2.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-01: Change abandoned on nova (stable/train)

#41

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811821
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-01:

#42

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811825
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-01:

#43

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811824
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-01:

#44

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811822
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-01:

#45

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811823
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

OpenStack Compute (nova)

Ironic node rebalance race can lead to missing compute nodes in DB

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	High	Mark Goddard
Ocata	New	Undecided	Unassigned
Pike	New	Undecided	Unassigned
Queens	New	Undecided	Unassigned
Rocky	New	Undecided	Unassigned
Stein	New	Undecided	Unassigned
Train	In Progress	Undecided	Unassigned
Ussuri	In Progress	High	Mark Goddard