Env: Focal/Ussuri
Client: python3-openstackclient 5.2.0-0ubuntu1.20.04.1
When user submits create request (ie, create subnet) to openstack api, and the request is processed synchronously (unlike server create for example), the client waits for the API response. If that request takes extensive time to complete, longer than any client idle time value on the API backend, the server closes the TCP connection with the openstackclient. In such cases the client automatically repeats the POST request, which by default is a wrong behaviour.
Following are logs from the 'openstack sbunet create example' where execution of the initial request took over 100s (keystone auth logs omitted for clarity):
# the command:
os subnet create \
--network cd275f37-deec-4f0f-9ad3-4816ca52678e \
--gateway 172.16.0.1 \
--allocation-pool start=172.16.0.100,end=172.16.0.200 \
--subnet-range 172.16.0.0/24 \
--dns-nameserver 8.8.8.8 \
--no-dhcp external-subnet-2 \
--debug
# the command output:
Network client initialized using OpenStack SDK: <openstack.network.v2._proxy.Proxy object at 0x7f7ec8f34490>
REQ: curl -g -i -X GET http://10.5.100.3:9696/v2.0/networks/cd275f37-deec-4f0f-9ad3-4816ca52678e -H "User-Agent: openstacksdk/0.46.0 keystoneauth1/4.0.0 python-requests/2.22.0 CPython/3.8.10" -H "X-Auth-Token: {SHA256}6d3a82e66bdeb642bcfef551c78dbc0063e1bad49431683e6a0023f28dbc1d7d"
Starting new HTTP connection (1): 10.5.100.3:9696
http://10.5.100.3:9696 "GET /v2.0/networks/cd275f37-deec-4f0f-9ad3-4816ca52678e HTTP/1.1" 200 726
RESP: [200] Connection: keep-alive Content-Length: 726 Content-Type: application/json Date: Thu, 16 Mar 2023 07:23:59 GMT X-Openstack-Request-Id: req-3abe5525-cde2-436d-bf39-34746d76f07a
RESP BODY: {"network":{"id":"cd275f37-deec-4f0f-9ad3-4816ca52678e","name":"ext_net","tenant_id":"d6761dacbb0649189a07a4a1a191a8c0","admin_state_up":true,"mtu":1500,"status":"ACTIVE","subnets":["040a6d57-b6fc-436a-b531-03baf4abb609"],"shared":false,"availability_zone_hints":[],"availability_zones":["nova"],"ipv4_address_scope":null,"ipv6_address_scope":null,"router:external":true,"description":"","port_security_enabled":true,"dns_domain":"","l2_adjacency":true,"is_default":false,"tags":[],"created_at":"2022-05-06T10:18:25Z","updated_at":"2023-03-16T06:35:33Z","revision_number":16,"project_id":"d6761dacbb0649189a07a4a1a191a8c0","provider:network_type":"flat","provider:physical_network":"physnet1","provider:segmentation_id":null}}
GET call to network for http://10.5.100.3:9696/v2.0/networks/cd275f37-deec-4f0f-9ad3-4816ca52678e used request id req-3abe5525-cde2-436d-bf39-34746d76f07a
REQ: curl -g -i -X POST http://10.5.100.3:9696/v2.0/subnets -H "Content-Type: application/json" -H "User-Agent: openstacksdk/0.46.0 keystoneauth1/4.0.0 python-requests/2.22.0 CPython/3.8.10" -H "X-Auth-Token: {SHA256}6d3a82e66bdeb642bcfef551c78dbc0063e1bad49431683e6a0023f28dbc1d7d" -d '{"subnet": {"allocation_pools": [{"start": "172.16.0.100", "end": "172.16.0.200"}], "ip_version": 4, "gateway_ip": "172.16.0.1", "name": "external-subnet-2", "dns_nameservers": ["8.8.8.8"], "enable_dhcp": false, "cidr": "172.16.0.0/24", "network_id": "cd275f37-deec-4f0f-9ad3-4816ca52678e"}}'
Failure: Unable to establish connection to http://10.5.100.3:9696/v2.0/subnets: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Retrying in 0.5s.
Starting new HTTP connection (2): 10.5.100.3:9696
http://10.5.100.3:9696 "POST /v2.0/subnets HTTP/1.1" 400 221
RESP: [400] Connection: keep-alive Content-Length: 221 Content-Type: application/json Date: Thu, 16 Mar 2023 07:25:30 GMT X-Openstack-Request-Id: req-d0632669-e1ce-4157-b5a5-e52be6e51873
RESP BODY: {"NeutronError": {"type": "InvalidInput", "message": "Invalid input for operation: Requested subnet with cidr: 172.16.0.0/24 for network: cd275f37-deec-4f0f-9ad3-4816ca52678e overlaps with another subnet.", "detail": ""}}
POST call to network for http://10.5.100.3:9696/v2.0/subnets used request id req-d0632669-e1ce-4157-b5a5-e52be6e51873
BadRequestException: 400: Client Error for url: http://10.5.100.3:9696/v2.0/subnets, Invalid input for operation: Requested subnet with cidr: 172.16.0.0/24 for network: cd275f37-deec-4f0f-9ad3-4816ca52678e overlaps with another subnet.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cliff/app.py", line 400, in run_subcommand
result = cmd.run(parsed_args)
File "/usr/lib/python3/dist-packages/osc_lib/command/command.py", line 41, in run
return super(Command, self).run(parsed_args)
File "/usr/lib/python3/dist-packages/cliff/display.py", line 117, in run
column_names, data = self.take_action(parsed_args)
File "/usr/lib/python3/dist-packages/openstackclient/network/v2/subnet.py", line 377, in take_action
obj = client.create_subnet(**attrs)
File "/usr/lib/python3/dist-packages/openstack/network/v2/_proxy.py", line 3491, in create_subnet
return self._create(_subnet.Subnet, **attrs)
File "/usr/lib/python3/dist-packages/openstack/proxy.py", line 417, in _create
return res.create(self, base_path=base_path)
File "/usr/lib/python3/dist-packages/openstack/resource.py", line 1291, in create
self._translate_response(response, has_body=has_body)
File "/usr/lib/python3/dist-packages/openstack/resource.py", line 1113, in _translate_response
exceptions.raise_from_response(response, error_message=error_message)
File "/usr/lib/python3/dist-packages/openstack/exceptions.py", line 234, in raise_from_response
raise cls(
openstack.exceptions.BadRequestException: BadRequestException: 400: Client Error for url: http://10.5.100.3:9696/v2.0/subnets, Invalid input for operation: Requested subnet with cidr: 172.16.0.0/24 for network: cd275f37-deec-4f0f-9ad3-4816ca52678e overlaps with another subnet.
clean_up CreateSubnet: BadRequestException: 400: Client Error for url: http://10.5.100.3:9696/v2.0/subnets, Invalid input for operation: Requested subnet with cidr: 172.16.0.0/24 for network: cd275f37-deec-4f0f-9ad3-4816ca52678e overlaps with another subnet.
END return value: 1
From the neutron-server.log the first request from client came in:
2023-03-16 07:23:59.608 2384978 DEBUG neutron.api.v2.base [req-66884669-2253-4faf-95a4-ebbe353896a0 93eeeebc2920450faaaf2395443505fd d6761dacbb0649189a07a4a1a191a8c0 - 6a4439622e71431c8b96073e33c3b7e1 6a4439622e71431c8b96073e33c3b7e1] Request body: {'subnet': {'allocation_pools': [{'start': '172.16.0.100', 'end': '172.16.0.200'}], 'ip_version': 4, 'gateway_ip': '172.16.0.1', 'name': 'external-subnet-2', 'dns_nameservers': ['8.8.8.8'], 'enable_dhcp': False, 'cidr': '172.16.0.0/24', 'network_id': 'cd275f37-deec-4f0f-9ad3-4816ca52678e'}} prepare_request_body /usr/lib/python3/dist-packages/neutron/api/v2/base.py:719
and ended successfully after 131s:
2023-03-16 07:26:10.768 2384978 INFO neutron.wsgi [req-66884669-2253-4faf-95a4-ebbe353896a0 93eeeebc2920450faaaf2395443505fd d6761dacbb0649189a07a4a1a191a8c0 - 6a4439622e71431c8b96073e33c3b7e1 6a4439622e71431c8b96073e33c3b7e1] 10.5.2.174 "POST /v2.0/subnets HTTP/1.1" status: 201 len: 0 time: 131.1619370
But in the meaintime, the second request came as a result of the client retry action:
2023-03-16 07:25:30.129 2384976 DEBUG neutron.api.v2.base [req-d0632669-e1ce-4157-b5a5-e52be6e51873 93eeeebc2920450faaaf2395443505fd d6761dacbb0649189a07a4a1a191a8c0 - 6a4439622e71431c8b96073e33c3b7e1 6a4439622e71431c8b96073e33c3b7e1] Request body: {'subnet': {'allocation_pools': [{'start': '172.16.0.100', 'end': '172.16.0.200'}], 'ip_version': 4, 'gateway_ip': '172.16.0.1', 'name': 'external-subnet-2', 'dns_nameservers': ['8.8.8.8'], 'enable_dhcp': False, 'cidr': '172.16.0.0/24', 'network_id': 'cd275f37-deec-4f0f-9ad3-4816ca52678e'}} prepare_request_body /usr/lib/python3/dist-packages/neutron/api/v2/base.py:719
And of course it failed validation:
2023-03-16 07:25:30.409 2384976 INFO neutron.wsgi [req-d0632669-e1ce-4157-b5a5-e52be6e51873 93eeeebc2920450faaaf2395443505fd d6761dacbb0649189a07a4a1a191a8c0 - 6a4439622e71431c8b96073e33c3b7e1 6a4439622e71431c8b96073e33c3b7e1] 10.5.2.174 "POST /v2.0/subnets HTTP/1.1" status: 400 len: 429 time: 0.2894270
The first request comlete sucessfully but the user wouldn't know about it. Instad he got measleading error as above.
This problem comes from the keystoneauth1 library, used by the python-openstackclient:
https://github.com/openstack/keystoneauth/blob/stable/ussuri/keystoneauth1/session.py#L1019
https://github.com/openstack/keystoneauth/blob/stable/ussuri/keystoneauth1/session.py#L1031
In order to reproduce the problem I set python debugger on the neutron server code, here:
https://opendev.org/openstack/neutron/src/branch/master/neutron/api/rpc/handlers/resources_rpc.py#L237
But this problem originally comes from the production environment with 60+ compute nodes (DVR) and notification about changes due to the subnet creations takes over 100s.
Kind regards,
Marcin
And increasing default 'timeout client' and 'timeout server' values on the neturon-api haproxy (from default 90000 ms) is a workaround, but still client should handle these type of situations differently.
Cheers,
Marcin