[tempest] "test_security_group_rules_create" unstable in "neutron-ovs-grenade-dvr-multinode"

Bug #2015065 reported by Rodolfo Alonso
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Critical
Lajos Katona
Tags: gate-failure
Revision history for this message
Bence Romsics (bence-romsics) wrote :

Frequencies:

11 occurrences out of the last 100 job runs:
logsearch log --project openstack/neutron --job neutron-ovs-grenade-dvr-multinode --limit 100 --file controller/logs/grenade.sh_log.txt 'test_security_group_rules_create .* FAILED'

15 occurrences out of the last 20 failed job runs:
logsearch log --project openstack/neutron --job neutron-ovs-grenade-dvr-multinode --limit 20 --result FAILURE --file controller/logs/grenade.sh_log.txt 'test_security_group_rules_create .* FAILED'

tags: added: gate-failure
Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Lajos Katona (lajos-katona) wrote :
Revision history for this message
Lajos Katona (lajos-katona) wrote :

I checked a few occurrances and one interesting thing is that these are (under tempest.api.compute.security_groups...) calling compute API os-security-group-rules, nova actually proxy these calls toward Neutron (via neutronclient).

In logs there is no sign of the issue with http timeout (i.e.: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5b6/880006/2/check/neutron-ovs-grenade-multinode/5b615cc/controller/logs/grenade.sh_log.txt )

Changed in neutron:
assignee: nobody → Lajos Katona (lajos-katona)
Revision history for this message
Lajos Katona (lajos-katona) wrote :
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Similar failures also appear in tempest jobs also (please check the opensearch link in comment #4).

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

The same error is happening with "tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesUnderV243Test.test_add_remove_fixed_ip"

Logs: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_50b/879827/24/gate/neutron-ovs-grenade-dvr-multinode/50b5e2d/controller/logs/grenade.sh_log.txt

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (45.0 KiB)

Took a look on this and below are the findings:-

- The issue is seen across multiple projects and jobs and is a random one, found a related bug in nova created long back[1] but that too don't have a RCA.
- Both tls/non-tls jobs impacted across different jobs/project/branches[2].
- seeing the issue mostly in stable/zed+ i suspected dbcounter is related and to rule it out tested disabling it in https://review.opendev.org/c/openstack/neutron/+/883282 but the issue reproduced[5]
- I checked couple of job logs and seen 2 categories(stuck[3] vs taking long time[4] than 60seconds or 180seconds(in rally job)), the stuck ones seen only in nova while non-stuck ones across projects so these can be considered different issue and investigated seperately, this bug can focus on the stuck case.
- In some cases seen oslo_messaging disconnections, so not sure if that's the issue and if heartbeat_in_pthread=True will help in this or if it's an issue with eventlet/uwsgi threads monkey patching, i think someone from nova may have some idea over this.
- Next i would like to collect gmr if that can give some hint for the issue.

[1] https://bugs.launchpad.net/tempest/+bug/1999893
[2] HTTPSConnectionPool

master 93.4%
stable/zed 2.5%
stable/2023.1 1.6%
stable/wallaby 0.8%
stable/victoria 0.8%

nova-ceph-multistore 49.6%
glance-multistore-cinder-import 13.1%
tempest-ipv6-only 4.9%
cinder-tempest-plugin-lvm-tgt-barbican 4.1%
nova-next 2.5%

openstack/nova 49.2%
openstack/glance 15.6%
openstack/cinder-tempest-plugin 15.2%
openstack/tempest 6.6%
openstack/neutron 5.7%

HTTPConnectionPool

nova-grenade-multinode 48.0%
grenade 12.0%
neutron-ovs-grenade-dvr-multinode 12.0%
grenade-skip-level-always 8.0%
neutron-ovs-grenade-multinode 8.0%

master 72.0%
stable/2023.1 24.0%
stable/zed 4.0%

openstack/nova 56.0%
openstack/neutron 24.0%
openstack/cinder 8.0%
openstack/devstack 8.0%
openstack/tempest 4.0%

[3]
http://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_77c/877533/7/check/neutron-ovs-grenade-multinode/77c03cd/testr_results.html

May 09 11:41:36.941261 np0033988595 <email address hidden>[132073]: DEBUG nova.api.openstack.wsgi [None req-8656e772-7302-4b65-aa2e-515d84339e82 tempest-SecurityGroupRulesTestJSON-1968675211 tempest-SecurityGroupRulesTestJSON-1968675211-project-member] Action: 'create', calling method: <function Controller.__getattribute__.<locals>.version_select at 0x7f5b17edc940>, body: {"security_group_rule": {"parent_group_id": "3e848242-0f2c-4c47-8a97-fc4492ca00de", "ip_protocol": "icmp", "from_port": -1, "to_port": -1}} {{(pid=132073) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:511}}

May 09 11:41:37.006089 np0033988595 neutron-server[126983]: INFO neutron.wsgi [req-8656e772-7302-4b65-aa2e-515d84339e82 req-ef556ac9-17c8-40b2-9569-56287f4e563e tempest-SecurityGroupRulesTestJSON-1968675211 tempest-SecurityGroupRulesTestJSON-1968675211-project-member] 10.209.38.241 "GET /v2.0/security-groups/3e848242-0f2c-4c47-8a97-fc4492ca00de HTTP/1.1" status: 200 len: 2415 time: 0.0520720

only security group check request on neutron side, no rule create request. nova worker stuck

https://657b14bfa6092f8bf722-e...

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (9.9 KiB)

<<< - Next i would like to collect gmr if that can give some hint for the issue.

Ok was able to reproduce and collect it in [1][2].

Also did multiple runs disabling dbcounter[3] but it didn't reproduce in the test patch. Since th e issue is random not sure if it's just a coincidence or dbcounter making the issue appear more frequently. We can disable in some jobs and see if it helps in reducing the occurrence as disabling it won't harm.

Stuck Thread Traceback:-
/opt/stack/new/nova/nova/api/openstack/urlmap.py:305 in __call__
    `return app(environ, start_response)`

/opt/stack/new/nova/nova/api/openstack/urlmap.py:202 in wrap
    `return app(environ, start_response)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:129 in __call__
    `resp = self.call_func(req, *args, **kw)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:193 in call_func
    `return self.func(req, *args, **kwargs)`

/usr/local/lib/python3.10/dist-packages/oslo_middleware/base.py:124 in __call__
    `response = req.get_response(self.application)`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1313 in send
    `status, headers, app_iter = self.call_application(`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1278 in call_application
    `app_iter = application(self.environ, start_response)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:129 in __call__
    `resp = self.call_func(req, *args, **kw)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:193 in call_func
    `return self.func(req, *args, **kwargs)`

/usr/local/lib/python3.10/dist-packages/oslo_middleware/base.py:124 in __call__
    `response = req.get_response(self.application)`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1313 in send
    `status, headers, app_iter = self.call_application(`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1278 in call_application
    `app_iter = application(self.environ, start_response)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:129 in __call__
    `resp = self.call_func(req, *args, **kw)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:193 in call_func
    `return self.func(req, *args, **kwargs)`

/usr/local/lib/python3.10/dist-packages/oslo_middleware/request_id.py:58 in __call__
    `response = req.get_response(self.application)`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1313 in send
    `status, headers, app_iter = self.call_application(`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1278 in call_application
    `app_iter = application(self.environ, start_response)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:129 in __call__
    `resp = self.call_func(req, *args, **kw)`

/usr/local/lib/python3.10/dist-packages/webob/dec.py:193 in call_func
    `return self.func(req, *args, **kwargs)`

/opt/stack/new/nova/nova/api/openstack/__init__.py:95 in __call__
    `return req.get_response(self.application)`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1313 in send
    `status, headers, app_iter = self.call_application(`

/usr/local/lib/python3.10/dist-packages/webob/request.py:1278 in call_application
    `app_iter = application(self.environ, start_resp...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/883648

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/883648
Committed: https://opendev.org/openstack/neutron/commit/1d0335810d89ede47cf3b54614382cb25d1986ae
Submitter: "Zuul (22348)"
Branch: master

commit 1d0335810d89ede47cf3b54614382cb25d1986ae
Author: yatinkarel <email address hidden>
Date: Fri May 19 14:59:25 2023 +0530

    Disable mysql gather performance in jobs

    We seeing random issue in CI as mentioned
    in the related bug. As per the tests done
    in [1] seems disabling it make the issue
    appear less frequent. Let's try it atleast
    until the root cause is fixed.

    [1] https://review.opendev.org/c/openstack/neutron/+/883282

    Related-Bug: #2015065
    Change-Id: I2738d161d828e8ab0524281d72ed1930e08e194b

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I looked at the stack trace of the blocked thread from https://bugs.launchpad.net/neutron/+bug/2015065/comments/8 (thanks Yatin for collecting the trace!)

Based on https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_bdb/882413/2/check/grenade/bdb653c/job-output.txt the environment uses eventlet 0.33.1 and urllib3 1.26.12.

The first interesting step at the stacktrace: /usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py:28 in is_connection_dropped

Which is https://github.com/urllib3/urllib3/blob/a5b29ac1025f9bb30f2c9b756f3b171389c2c039/src/urllib3/connectionpool.py#L272

So urllib try to check if the existing client connection is still usable or got disconnected
https://github.com/urllib3/urllib3/blob/a5b29ac1025f9bb30f2c9b756f3b171389c2c039/src/urllib3/util/connection.py#L28

It calls wait_for_read(sock, timeout=0.0)
So it checks if it can read from the socket with 0.0 timeout

https://github.com/urllib3/urllib3/blob/a5b29ac1025f9bb30f2c9b756f3b171389c2c039/src/urllib3/util/wait.py#L84-L85

That 0.0 timeout is passed to python's select.select
https://docs.python.org/3.10/library/select.html#select.select
"The optional timeout argument specifies a time-out as a floating point number in seconds. When the timeout argument is omitted the function blocks until at least one file descriptor is ready. A time-out value of zero specifies a poll and never blocks."
So that select.select called with 0.0 should never block

BUT

In our env the envtlet monkey patching is changing python's select.select hence the stack trace points to /usr/local/lib/python3.10/dist-packages/eventlet/green/select.py:80 in select

https://github.com/eventlet/eventlet/blob/88ec603404b2ed25c610dead75d4693c7b3e8072/eventlet/green/select.py#L30-L80C32

Looking at that code it seems enventlet sets a timer with the timeout value via hub.schedule_call_global Here I'm getting lost in the eventlet code but I assume sheduling a timer with 0.0 timeout in eventlet can be racy based on the comment in https://github.com/eventlet/eventlet/blob/88ec603404b2ed25c610dead75d4693c7b3e8072/eventlet/green/select.py#L62-L69

So one could argue that what we see is an eventlet bug as select.select with timeout=0.0 should not ever block but it does block in our case.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ):

I tried to create a pure reproducer but the below code does not hang with eventlet 0.33.1 in py3.10
```
import eventlet

eventlet.monkey_patch()

import socket
import select

def main():
    s1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s1.connect(("127.0.0.1", 8080))
    print(select.select([s1], [], [], 0.0))

if __name__ == "__main__":
    main()
```

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ):

Forcing a sleep just before https://github.com/eventlet/eventlet/blob/88ec603404b2ed25c610dead75d4693c7b3e8072/eventlet/green/select.py#L80 causes the above reproducer to hang forever. However I'm not sure that forcing a sleep in that eventlet code is a valid reproducer of a loaded system.

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (4.5 KiB)

While checking another issue https://bugs.launchpad.net/grenade/+bug/2020643 which is triggered by this stuck issue, i saw one request in neutron too timed out after 900 seconds[1].

Tempest triggered security group delete but it timed out and retried and succeeded in the second attempt:-

2023-05-22 04:28:48.568 69207 WARNING urllib3.connectionpool [-] Retrying (Retry(total=9, connect=None, read=None, redirect=5, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='173.231.255.73', port=80): Read timed out. (read timeout=60)")': /compute/v2.1/os-security-groups/76a08f39-c1e7-47e5-b52f-c4de53a46146
2023-05-22 04:28:48.707 69207 INFO tempest.lib.common.rest_client [req-1fd0ae50-5f36-40de-9d48-243f74c59335 req-1fd0ae50-5f36-40de-9d48-243f74c59335 ] Request (SecurityGroupRulesTestJSON:tearDownClass): 202 DELETE http://173.231.255.73/compute/v2.1/os-security-groups/76a08f39-c1e7-47e5-b52f-c4de53a46146 60.179s

The original request is stuck in nova/neutron and times out after 900s(client_socket_timeout).

nova:-
May 22 04:27:48.532171 np0034092361 <email address hidden>[51094]: DEBUG nova.api.openstack.wsgi [None req-fc94b238-88f2-4e4b-8a25-9ed456ac652a tempest-SecurityGroupRulesTestJSON-54591405 tempest-SecurityGroupRulesTestJSON-54591405-project-member] Calling method '<function Controller.__getattribute__.<locals>.version_select at 0x7f9129c5dea0>' {{(pid=51094) _process_stack /opt/stack/old/nova/nova/api/openstack/wsgi.py:513}}
May 22 04:42:48.780841 np0034092361 <email address hidden>[51094]: DEBUG neutronclient.v2_0.client [None req-fc94b238-88f2-4e4b-8a25-9ed456ac652a tempest-SecurityGroupRulesTestJSON-54591405 tempest-SecurityGroupRulesTestJSON-54591405-project-member] Error message: {"error": {"code": 401, "title": "Unauthorized", "message": "The request you have made requires authentication."}} {{(pid=51094) _handle_fault_response /usr/local/lib/python3.10/dist-packages/neutronclient/v2_0/client.py:262}}
May 22 04:42:48.781975 np0034092361 <email address hidden>[51094]: INFO nova.api.openstack.requestlog [None req-fc94b238-88f2-4e4b-8a25-9ed456ac652a tempest-SecurityGroupRulesTestJSON-54591405 tempest-SecurityGroupRulesTestJSON-54591405-project-member] 173.231.255.73 "DELETE /compute/v2.1/os-security-groups/76a08f39-c1e7-47e5-b52f-c4de53a46146" status: 500 len: 0 microversion: 2.1 time: 900.252509
May 22 04:42:48.792104 np0034092361 <email address hidden>[51094]: Mon May 22 04:42:48 2023 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /compute/v2.1/os-security-groups/76a08f39-c1e7-47e5-b52f-c4de53a46146 (ip 173.231.255.73) !!!
May 22 04:42:48.792104 np0034092361 <email address hidden>[51094]: Mon May 22 04:42:48 2023 - uwsgi_response_writev_headers_and_body_do(): Broken pipe [core/writer.c line 306] during DELETE /compute/v2.1/os-security-groups/76a08f39-c1e7-47e5-b52f-c4de53a46146 (173.231.255.73)
May 22 04:42:48.792104 np0034092361 <email address hidden>[51094]: CRITICAL nova [None req-fc94b238-88f2-4e4b-8a25-9ed456ac652a tempest-SecurityGroupRulesTestJSON-54591405 tempest-SecurityGroupRulesTestJSON-54591405-project-member] Unhandled error: OSError: write error...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.