This is a genuine race which is still present in master, but for some reason I haven't managed to work out I can only reproduce on Rocky. I can reproduce it reliably on Rocky by adding a pdb breakpoint (a sleep doesn't work) to stub_check_num_instances_quota in test_bfv_quota_race_local_delete. The same does not reproduce on master... although the bug is still present.
The issue is that in conductor manager when the check_num_instances_quota check raises _cleanup_build_artifacts we create the instance mapping before creating BDMs and tags in the target cell, i.e. before the instance record is complete in that cell. When the functional test races, the fetch in _wait_for_state_change pulls the instance from the cell before it has finished being written, and therefore before the tags are present. See compute.API._get_instance().
I don't understand why this race occurs only in Rocky, but in master the test's GET is just never scheduled there. This is highly likely to be some really subtle eventlet thing which isn't important. However, if it was it would fail in the same way.
This is a genuine race which is still present in master, but for some reason I haven't managed to work out I can only reproduce on Rocky. I can reproduce it reliably on Rocky by adding a pdb breakpoint (a sleep doesn't work) to stub_check_ num_instances_ quota in test_bfv_ quota_race_ local_delete. The same does not reproduce on master... although the bug is still present.
The issue is that in conductor manager when the check_num_ instances_ quota check raises _cleanup_ build_artifacts we create the instance mapping before creating BDMs and tags in the target cell, i.e. before the instance record is complete in that cell. When the functional test races, the fetch in _wait_for_ state_change pulls the instance from the cell before it has finished being written, and therefore before the tags are present. See compute. API._get_ instance( ).
I don't understand why this race occurs only in Rocky, but in master the test's GET is just never scheduled there. This is highly likely to be some really subtle eventlet thing which isn't important. However, if it was it would fail in the same way.