Concurrent evacuation of vms with pinned cpus to the same host fail randomly

Bug #1988311 reported by Bence Romsics
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Unassigned
Yoga
In Progress
High
Unassigned
oslo.concurrency
Fix Released
Undecided
Unassigned

Bug Description

Reproduction:

Boot two vms (each with one pinned cpu) on devstack0.
Then evacuate them to devtack0a.
devstack0a has two dedicated cpus, so both vms should fit.
However sometimes (for example 6 out of 10 times) the evacuation of one vm fails with this error message: 'CPU set to pin [0] must be a subset of free CPU set [1]'.

devstack0 - all-in-one host
devstack0a - compute-only host

# have two dedicated cpus for pinning on the evacuation target host
devstack0a:/etc/nova/nova-cpu.conf:
[compute]
cpu_dedicated_set = 0,1

# the dedicated cpus are properly tracked in placement
$ openstack resource provider list
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
| a0574d87-42ee-4e13-b05a-639dc62c1196 | devstack0a | 2 | a0574d87-42ee-4e13-b05a-639dc62c1196 | None |
| 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | devstack0 | 2 | 2e6fac42-d6e3-4366-a864-d5eb2bdc2241 | None |
+--------------------------------------+------------+------------+--------------------------------------+----------------------+
$ openstack resource provider inventory list a0574d87-42ee-4e13-b05a-639dc62c1196
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| MEMORY_MB | 1.5 | 1 | 3923 | 512 | 1 | 3923 | 0 |
| DISK_GB | 1.0 | 1 | 28 | 0 | 1 | 28 | 0 |
| PCPU | 1.0 | 1 | 2 | 0 | 1 | 2 | 0 |
+----------------+------------------+----------+----------+----------+-----------+-------+------+

# use vms with one pinned cpu
openstack flavor create cirros256-pinned --public --ram 256 --disk 1 --vcpus 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated

# boot two vms (each with one pinned cpu) on devstack0
n=2 ; for i in $( seq $n ) ; do openstack server create --flavor cirros256-pinned --image cirros-0.5.2-x86_64-disk --nic net-id=private --availability-zone :devstack0 --wait vm$i ; done

# kill n-cpu on devstack0
devstack0 $ sudo systemctl stop devstack@n-cpu
# and force it down, so we can start evacuating
openstack compute service set devstack0 nova-compute --down

# evacuate both vms to devstack0a concurrently
for vm in $( openstack server list --host devstack0 -f value -c ID ) ; do openstack --os-compute-api-version 2.29 server evacuate --host devstack0a $vm & done

# follow up on how the evacuation is going, check if the bug occured, see details a bit below
for i in $( seq $n ) ; do openstack server show vm$i -f value -c OS-EXT-SRV-ATTR:host -c status ; done

# clean up
devstack0 $ sudo systemctl start devstack@n-cpu
openstack compute service set devstack0 nova-compute --up
for i in $( seq $n ) ; do openstack server delete vm$i --wait ; done

This bug is not deterministic. For example out of 10 tries (like above) I have seen 4 successes - when both vms successfully evacuated to (went to ACTIVE on) devstack0a.

But in the other 6 cases only one vm evacuated successfully. The other vm went to ERROR state, with the error message: "CPU set to pin [0] must be a subset of free CPU set [1]". For example:

$ openstack server show vm2
...
| fault | {'code': 400, 'created': '2022-08-24T13:50:33Z', 'message': 'CPU set to pin [0] must be a subset of free CPU set [1]'} |
...

In n-cpu logs we see the following:

aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Setting instance vm_state to ERROR: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 10375, in _error_out_instance_on_exception
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] yield
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR nova.compute.manager [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348]·
aug 24 13:50:33 devstack0a nova-compute[246038]: INFO nova.compute.manager [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] [instance: dc3acde3-f1c6-41a9-9a12-0c278ad4b348] Successfully reverted task state from rebuilding on failure for instance.
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server [None req-278f5b67-a765-4231-b2b9-db3f8c7fe092 admin admin] Exception during message handling: nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 163, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 154, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/utils.py", line 1439, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 210, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self.force_reraise()
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise self.value
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 200, in decorated_function
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3564, in rebuild_instance
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._do_rebuild_instance_with_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/manager.py", line 3641, in _do_rebuild_instance_with_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server claim_context = rebuild_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py", line 395, in inner
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 204, in rebuild_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server return self._move_claim(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 348, in _move_claim
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage_from_migration(context, instance, migration,
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1411, in _update_usage_from_migration
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server self._update_usage(usage, nodename)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1326, in _update_usage
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server cn.numa_topology = hardware.numa_usage_from_instance_numa(
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/virt/hardware.py", line 2567, in numa_usage_from_instance_numa
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server new_cell.pin_cpus(pinned_cpus)
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/objects/numa.py", line 95, in pin_cpus
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server raise exception.CPUPinningInvalid(requested=list(cpus),
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server nova.exception.CPUPinningInvalid: CPU set to pin [0] must be a subset of free CPU set [1]
aug 24 13:50:33 devstack0a nova-compute[246038]: ERROR oslo_messaging.rpc.server

devstack 90e5479f
nova ddcc286ee1
hypervisor: libvirt/kvm
libvirt 8.0.0-1ubuntu7.1
linux 5.15.0-46-generic
networking: neutron ml2/ovs

This environment had the default 2 scheduler workers, but AFAIU this should not matter, since both vms should fit on the target host and this means this is likely a bug in the compute and not in the scheduler.

Let me know if I can collect more information.

summary: - Concurrent evacuation of vms with pinned cpus fail randomly
+ Concurrent evacuation of vms with pinned cpus to the same host fail
+ randomly
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Bence: thanks for the report. Could you attach the nova-compute debug log from devstack0a (the target host of the evacuation) that contains at least the time frame of the two evac request arriving and then the second one failing? This would help a lot seeing what steps are overlapping.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

The following UTC timestamps may be relevant for the attached n-cpu.log:

2022-09-01 13:51:27 just before started booting two vms on devstack0
2022-09-01 13:51:54 just before started the evacuations concurrently
2022-09-01 13:52:12 just after one vm went to ACTIVE and the other to ERROR

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (3.4 KiB)

This looks like our locking is broken:

$ grep -i 'lock "compute_resources"' n-cpu.log | egrep -e 'req-e892fc24-0651-4c08-be7c-551c1ecc7bc7|req-e1b62bbc-91d8-46ca-bba4-451391c6b1a1'
2022-09-01T13:52:02.144032+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e892fc24-0651-4c08-be7c-551c1ecc7bc7 admin admin] Acquiring lock "compute_resources" by "nova.compute.resource_tracker.ResourceTracker.rebuild_claim" {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:385}}
2022-09-01T13:52:02.144786+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e892fc24-0651-4c08-be7c-551c1ecc7bc7 admin admin] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.rebuild_claim" :: waited 0.001s {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:390}}
2022-09-01T13:52:02.147508+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e1b62bbc-91d8-46ca-bba4-451391c6b1a1 admin admin] Acquiring lock "compute_resources" by "nova.compute.resource_tracker.ResourceTracker.rebuild_claim" {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:385}}
2022-09-01T13:52:02.148136+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e1b62bbc-91d8-46ca-bba4-451391c6b1a1 admin admin] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.rebuild_claim" :: waited 0.001s {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:390}}
2022-09-01T13:52:02.388504+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e1b62bbc-91d8-46ca-bba4-451391c6b1a1 admin admin] Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker.rebuild_claim" :: held 0.240s {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:404}}
2022-09-01T13:52:02.455376+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e892fc24-0651-4c08-be7c-551c1ecc7bc7 admin admin] Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker.rebuild_claim" :: held 0.311s {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:404}}
2022-09-01T13:52:07.640135+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e892fc24-0651-4c08-be7c-551c1ecc7bc7 admin admin] Acquiring lock "compute_resources" by "nova.compute.resource_tracker.ResourceTracker.finish_evacuation" {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:385}}
2022-09-01T13:52:07.640764+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None req-e892fc24-0651-4c08-be7c-551c1ecc7bc7 admin admin] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.finish_evacuation" :: waited 0.001s {{(pid=79776) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:390}}
2022-09-01T13:52:08.053607+0000 devstack0a nova-compute[79776]: DEBUG oslo_concurrency.lockutils [None re...

Read more...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I think fasteners >= 0.15 + eventlet.spawn_n breaks oslo.concurrency fair locking. Here is a minimal reproduction that showing the issue: https://gist.github.com/gibizer/9051369e67fd46a20d52963dac534852

It seems it all boils down to the eventlet issue: https://github.com/eventlet/eventlet/issues/731

Nova did a test only fix in the past https://review.opendev.org/c/openstack/nova/+/813114 but something more global is needed as oslo.concurrency uses fasteners' ReaderWriterLock under the hood to implement fair locking: https://github.com/openstack/oslo.concurrency/blob/5.0.0/oslo_concurrency/lockutils.py#L287

We can try to adapt the fix from https://review.opendev.org/c/openstack/nova/+/813114 and apply it for every fair lock created by oslo.

Changed in nova:
status: New → Confirmed
status: Confirmed → Triaged
importance: Undecided → Critical
tags: added: oslo zed-rc-potential
Revision history for this message
melanie witt (melwitt) wrote :

Adding a comment here to add context to the IRC discussion [1] about nova's use of other libraries that culminates in calls of eventlet spawn_n().

The example I ran into is described here [2] where I tried a patch to string find-and-replace of spawn_n with spawn everywhere in nova. On top of that patch I did a now abandoned poison patch to poison any use of spawn_n [3]. The poison patch failed pretty much everywhere in func tests because of an eventual indirect call of spawn_n() by oslo.messaging.

It might be easier to see what's going on if you reproduce what I saw locally and look at the failures:

  $ git review -d 818042

  $ tox -efunctional -r

(I needed the -r because we no longer use the mock library and the patch is old)

So, due to that result, I thought I had found that simply doing a string replacement of spawn_n to spawn would not be 100% guaranteed across the board to solve the problem.

To be clear, I'm not saying that the oslo.messaging call of spawn_n() is a problem. What I meant was to point out that it would be "possible" for other things to break if third-party calls of spawn_n() lead to a similar result as we have seen with fasteners, if we went with only a string replacement approach.

However, doing an actual patching of spawn_n to spawn globally might work [4]:

  import eventlet
  eventlet.spawn_n = eventlet.spawn
  eventlet.convenient.spawn_n = eventlet.spawn
  eventlet.monkey_patch()

In [4] the eventlet maintainer says they don't recommend doing ^ in production code. But I'm not sure we have a choice.

Another approach would be to patch only threading.current_thread with eventlet.getcurrent globally.

If we patched only fasteners, it might leave us open to other problems if any other library is using threading.current_thread in a similar manner.

Finally, we might be able to just do the string replacement thing to get out of the current situation. We could try it if we have a good reproducer.

I just did not feel certain whether it would fix the real world examples people are having like what is reported in this bug. At the time, I knew of no real world failure happening due to the fasteners issue and did not want to push for controversial changes in nova (it was hard enough to get even the test-only patch landed) without having any anecdotal proof. All I knew was that in theory it would cause problems in nova, not limited to only testing, but no operator had reported it yet.

I hope some part of this is helpful.

[1] https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2022-09-02.log.html#t2022-09-02T16:50:51
[2] https://review.opendev.org/c/openstack/nova/+/813114/3//COMMIT_MSG
[3] https://review.opendev.org/c/openstack/nova/+/818042
[4] https://github.com/eventlet/eventlet/issues/731#issuecomment-968135262

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.concurrency (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.concurrency/+/855713

Changed in oslo.concurrency:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.concurrency (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/855717

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/855718

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.concurrency (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.concurrency/+/855713
Committed: https://opendev.org/openstack/oslo.concurrency/commit/796203c94846d44237d869e3b20a6cd8a3602e36
Submitter: "Zuul (22348)"
Branch: master

commit 796203c94846d44237d869e3b20a6cd8a3602e36
Author: Balazs Gibizer <email address hidden>
Date: Sat Sep 3 13:16:56 2022 +0200

    Prove that spawn_n with fair lock is broken

    The fasteners lib in version 0.15.0 removed the
    threading.current_thread workaround for eventlet[1] because eventlet
    seemed to fixed the current_thread issues tracked in [2]. However the
    fix for [2] only fixed half of the problem. The threading.current_thread
    call works if it is called from thread created by eventlet.spawn.
    However if the thread is created with eventlet.spawn_n then
    threading.current_thread is still broken and returns the ID of the
    python native thread.

    The fasteners' ReaderWriterLock depends heavily on
    threading.current_thread to decide which thread holds a lock and to
    allow re-entry of that thread. This leads to the situation that
    multiple threads created from spawn_n could take the same
    ReaderWriterLock at the same time.

    The fair internal lock in oslo.concurrency uses ReaderWriterLock and
    as a result such lock is broken for threads created with spawn_n.

    Note that this issue was raised with eventlet in [3] when the nova team
    detected it via a direct usage of ReaderWriterLock in the nova test
    code. As [3] did not lead to a solution in eventlet nova implemented a
    nova local fix for the test code in [4].

    However now we detected that oslo.concurrency is affected by this issue
    as well.

    This patch adds tests to show the scope of the problem.

    Note that the coverage tox target is changed to explicitly enable native
    threading otherwise it runs eventlet specific tests in a native
    environment.

    Also note that [5] was opened to reintroduce the workaround[1] in fasteners.

    [1] https://github.com/harlowja/fasteners/commit/467ed75ee1e9465ebff8b5edf452770befb93913
    [2] https://github.com/eventlet/eventlet/issues/172
    [3] https://github.com/eventlet/eventlet/issues/731
    [4] https://review.opendev.org/c/openstack/nova/+/813114
    [5] https://github.com/harlowja/fasteners/issues/96

    Related-Bug: #1988311
    Change-Id: Ibc193c855b49b95b46ebd2aac82ea89e33f885f0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.concurrency (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.concurrency/+/855714
Committed: https://opendev.org/openstack/oslo.concurrency/commit/ee3f73a13379a79282325906787aae7da0f6ac27
Submitter: "Zuul (22348)"
Branch: master

commit ee3f73a13379a79282325906787aae7da0f6ac27
Author: Balazs Gibizer <email address hidden>
Date: Sat Sep 3 15:35:35 2022 +0200

    Fix fair internal lock used from eventlet.spawn_n

    The fasteners lib in version 0.15.0 removed the
    threading.current_thread workaround for eventlet[1] because eventlet
    seemed to fixed the current_thread issues tracked in [2]. However the
    fix for [2] only fixed half of the problem. The threading.current_thread
    call works if it is called from thread created by eventlet.spawn.
    However if the thread is created with eventlet.spawn_n then
    threading.current_thread is still broken and returns the ID of the
    python native thread.

    The fasteners' ReaderWriterLock depends heavily on
    threading.current_thread to decide which thread holds a lock and to
    allow re-entry of that thread. This leads to the situation that
    multiple threads created from spawn_n could take the same
    ReaderWriterLock at the same time.

    The fair internal lock in oslo.concurrency uses ReaderWriterLock and
    as a result such lock is broken for threads created with spawn_n.

    Note that this issue was raised with eventlet in [3] when the nova team
    detected it via a direct usage of ReaderWriterLock in the nova test
    code. As [3] did not lead to a solution in eventlet nova implemented a
    nova local fix for the test code in [4].

    However now we detected that oslo.concurrency is affected by this issue
    as well.

    This patch restores the workaround that was removed by [1].

    Note that a fasteners issue [5] also opened to restore the
    workaround[1].

    [1] https://github.com/harlowja/fasteners/commit/467ed75ee1e9465ebff8b5edf452770befb93913
    [2] https://github.com/eventlet/eventlet/issues/172
    [3] https://github.com/eventlet/eventlet/issues/731
    [4] https://review.opendev.org/c/openstack/nova/+/813114
    [5] https://github.com/harlowja/fasteners/issues/96

    Closes-Bug: #1988311
    Change-Id: Ia873bcc6b07121c9bd0b94c593567d537b4c1112

Changed in oslo.concurrency:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.concurrency 5.0.1

This issue was fixed in the openstack/oslo.concurrency 5.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.concurrency (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.concurrency/+/856120

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Setting to High as we need to bump our requirements on master to prevent older releases of oslo.concurrency.

Also, need to backport the patch into stable releases of oslo.concurrency for Yoga.

Changed in nova:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/857491

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Balazs Gibizer <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/855717
Reason: it is fixed in oslo so lets use that https://review.opendev.org/c/openstack/nova/+/857491

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/857491
Committed: https://opendev.org/openstack/nova/commit/8f5fd2ae4609c6d5b2ba2465e04c99e58be58e7a
Submitter: "Zuul (22348)"
Branch: master

commit 8f5fd2ae4609c6d5b2ba2465e04c99e58be58e7a
Author: Balazs Gibizer <email address hidden>
Date: Tue Sep 13 18:09:00 2022 +0200

    Bump min oslo.concurrencty to >= 5.0.1

    The oslo.concurrency 5.0.1 fixes the fair lock we use heavily in the
    ResourceTracker.

    Closes-Bug: #1988311
    Change-Id: I68914b2e21726138ee9a178fdf6a8bb6389c09be

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.0.0.0rc1

This issue was fixed in the openstack/nova 26.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/859421

Revision history for this message
Brian Brookshire (bkbrookshire) wrote :

We are hitting this issue constantly and this issue seems to be addressed upstream. Is there a fix for Yoga?

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi Brian,

The patches for Yoga were proposed:
https://review.opendev.org/q/topic:bug/1988311+branch:stable/yoga

But as you can see from the comments and votes they are not going to get merged upstream (because of a major version bump in a required library). If you're willing to tackle the consequences of the bump, you can take the patches of course.

Also note that this bug was fixed and until then worked around in multiple components, so technically likely you can find alternate solutions by bumping other components, see:
https://review.opendev.org/c/openstack/oslo.concurrency/+/855714 or
https://github.com/harlowja/fasteners/issues/96

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.