Greetings,
a couple of days ago we upgraded octavia to yoga(10.1.0) in our test environment.
We also upgraded our octavia-tempest-plguin version to 2.4.1 to get the new prometheus listener tests.
Since those upgrades tempest fails for its tearDownClass in `octavia_tempest_plugin.tests.api.v2.test_listener.ListenerAPITest.*`.
As this fails 'almost' everytime for us I tried to debug this and for me it seems that there could be a race condition in cascade delete.
The traceback I am getting for why the cascade delete is not working is the following:
[Traceback (most recent call last):, File "/var/lib/kolla/venv/lib/python3.8/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task, result = task.execute(**arguments), File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/controller/worker/v2/tasks/network_tasks.py", line 704, in execute, self.network_driver.update_vip(loadbalancer, for_delete=True), File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/network/drivers/neutron/allowed_address_pairs.py", line 644, in update_vip, self._update_security_group_rules(load_balancer,, File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/network/drivers/neutron/allowed_address_pairs.py", line 221, in _update_security_group_rules, self._create_security_group_rule(sec_grp_id, port_protocol[1],, File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/network/drivers/neutron/base.py", line 160, in _create_security_group_rule, self.neutron_client.create_security_group_rule(rule), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 1049, in create_security_group_rule, return self.post(self.security_group_rules_path, body=body), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 361, in post, return self.do_request("POST", action, body=body,, File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 297, in do_request, self._handle_fault_response(status_code, replybody, resp), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 272, in _handle_fault_response, exception_handler_v20(status_code, error_body), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 90, in exception_handler_v20, raise client_exc(message=error_message,, neutronclient.common.exceptions.Conflict: Security group rule already exists. Rule id is 08bedc57-cc6e-41bb-8a13-597887980dc5., Neutron server returns request_ids: ['req-f1bdc5cc-bfda-412d-952a-98eb4e18dc81']]
This is getting triggert from the following flow:
Task 'delete_update_vip_8beed3b6-b8e8-472b-a9a4-883a52675176' (33c5a41f-f3ab-4406-831e-4175d353d585) transitioned into state 'FAILURE' from state 'RUNNING'
After digging through the code the delete is going through the following code [1] which it should never go through on a delete task?
If I downgrade the octavia-tempest-plugin to a version that does not include the Prometheus protocol the delete always works without any issue which makes me to believe that there might be some race condition when the new prometheus listener is configured on a loadbalancer.
The lb that got into a provisioning_status ERROR after a cascade delete can correctly be deleted when executing a cascade delete a second time on the loadbalancer.
Does anyone maybe has an idea what this could be triggered by?
[1] https://github.com/openstack/octavia/blob/10.1.0/octavia/network/drivers/neutron/allowed_address_pairs.py#L220-L225
a similar issue was reported (by me) for the scenario tests:
https:/ /storyboard. openstack. org/#!/ story/2010338
IIRC I investigated this issue but I didn't find anything
@maximilian is it 100% reproducible in your env?
in the CI, we run the API Tests in noop mode (it means that we have dummy neutron/amphora/etc drivers), so maybe we don't see this issue because it's hidden by the dummy drivers, we can try to create a job with the default drivers for testing it.
--
a few notes from the storyboard:
It's weird that Octavia sends a POST request to Neutron during the deletion of a LB. Maybe on (cascade) DELETE requests, we could:
* either avoid adding/creating resources
* or ignore conflicts