[tempest] "test_security_group_rules_create" unstable in "neutron-ovs-grenade-dvr-multinode"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Confirmed
|
Critical
|
Lajos Katona |
Bence Romsics (bence-romsics) wrote : | #1 |
tags: | added: gate-failure |
Changed in neutron: | |
status: | New → Confirmed |
importance: | Undecided → Critical |
Lajos Katona (lajos-katona) wrote : | #2 |
Lajos Katona (lajos-katona) wrote : | #3 |
I checked a few occurrances and one interesting thing is that these are (under tempest.
In logs there is no sign of the issue with http timeout (i.e.: https:/
Changed in neutron: | |
assignee: | nobody → Lajos Katona (lajos-katona) |
Lajos Katona (lajos-katona) wrote : | #4 |
After checking some examples (another filter in opensearch:
The issue with urllib3.
or with server API: https:/
but the problem appears mostly with create_
Lajos Katona (lajos-katona) wrote : | #5 |
Similar failures also appear in tempest jobs also (please check the opensearch link in comment #4).
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote : | #6 |
The same error is happening with "tempest.
yatin (yatinkarel) wrote : | #7 |
Took a look on this and below are the findings:-
- The issue is seen across multiple projects and jobs and is a random one, found a related bug in nova created long back[1] but that too don't have a RCA.
- Both tls/non-tls jobs impacted across different jobs/project/
- seeing the issue mostly in stable/zed+ i suspected dbcounter is related and to rule it out tested disabling it in https:/
- I checked couple of job logs and seen 2 categories(stuck[3] vs taking long time[4] than 60seconds or 180seconds(in rally job)), the stuck ones seen only in nova while non-stuck ones across projects so these can be considered different issue and investigated seperately, this bug can focus on the stuck case.
- In some cases seen oslo_messaging disconnections, so not sure if that's the issue and if heartbeat_
- Next i would like to collect gmr if that can give some hint for the issue.
[1] https:/
[2] HTTPSConnectionPool
master 93.4%
stable/zed 2.5%
stable/2023.1 1.6%
stable/wallaby 0.8%
stable/victoria 0.8%
nova-ceph-
glance-
tempest-ipv6-only 4.9%
cinder-
nova-next 2.5%
openstack/nova 49.2%
openstack/glance 15.6%
openstack/
openstack/tempest 6.6%
openstack/neutron 5.7%
HTTPConnectionPool
nova-grenade-
grenade 12.0%
neutron-
grenade-
neutron-
master 72.0%
stable/2023.1 24.0%
stable/zed 4.0%
openstack/nova 56.0%
openstack/neutron 24.0%
openstack/cinder 8.0%
openstack/devstack 8.0%
openstack/tempest 4.0%
May 09 11:41:36.941261 np0033988595 <email address hidden>[132073]: DEBUG nova.api.
May 09 11:41:37.006089 np0033988595 neutron-
only security group check request on neutron side, no rule create request. nova worker stuck
yatin (yatinkarel) wrote : | #8 |
<<< - Next i would like to collect gmr if that can give some hint for the issue.
Ok was able to reproduce and collect it in [1][2].
Also did multiple runs disabling dbcounter[3] but it didn't reproduce in the test patch. Since th e issue is random not sure if it's just a coincidence or dbcounter making the issue appear more frequently. We can disable in some jobs and see if it helps in reducing the occurrence as disabling it won't harm.
Stuck Thread Traceback:-
/opt/stack/
`return app(environ, start_response)`
/opt/stack/
`return app(environ, start_response)`
/usr/local/
`resp = self.call_func(req, *args, **kw)`
/usr/local/
`return self.func(req, *args, **kwargs)`
/usr/local/
`response = req.get_
/usr/local/
`status, headers, app_iter = self.call_
/usr/local/
`app_iter = application(
/usr/local/
`resp = self.call_func(req, *args, **kw)`
/usr/local/
`return self.func(req, *args, **kwargs)`
/usr/local/
`response = req.get_
/usr/local/
`status, headers, app_iter = self.call_
/usr/local/
`app_iter = application(
/usr/local/
`resp = self.call_func(req, *args, **kw)`
/usr/local/
`return self.func(req, *args, **kwargs)`
/usr/local/
`response = req.get_
/usr/local/
`status, headers, app_iter = self.call_
/usr/local/
`app_iter = application(
/usr/local/
`resp = self.call_func(req, *args, **kw)`
/usr/local/
`return self.func(req, *args, **kwargs)`
/opt/stack/
`return req.get_
/usr/local/
`status, headers, app_iter = self.call_
/usr/local/
`app_iter = application(
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #9 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #10 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 1d0335810d89ede
Author: yatinkarel <email address hidden>
Date: Fri May 19 14:59:25 2023 +0530
Disable mysql gather performance in jobs
We seeing random issue in CI as mentioned
in the related bug. As per the tests done
in [1] seems disabling it make the issue
appear less frequent. Let's try it atleast
until the root cause is fixed.
[1] https:/
Related-Bug: #2015065
Change-Id: I2738d161d828e8
Balazs Gibizer (balazs-gibizer) wrote : | #11 |
I looked at the stack trace of the blocked thread from https:/
Based on https:/
The first interesting step at the stacktrace: /usr/local/
So urllib try to check if the existing client connection is still usable or got disconnected
https:/
It calls wait_for_read(sock, timeout=0.0)
So it checks if it can read from the socket with 0.0 timeout
That 0.0 timeout is passed to python's select.select
https:/
"The optional timeout argument specifies a time-out as a floating point number in seconds. When the timeout argument is omitted the function blocks until at least one file descriptor is ready. A time-out value of zero specifies a poll and never blocks."
So that select.select called with 0.0 should never block
BUT
In our env the envtlet monkey patching is changing python's select.select hence the stack trace points to /usr/local/
Looking at that code it seems enventlet sets a timer with the timeout value via hub.schedule_
So one could argue that what we see is an eventlet bug as select.select with timeout=0.0 should not ever block but it does block in our case.
Balazs Gibizer (balazs-gibizer) wrote (last edit ): | #12 |
I tried to create a pure reproducer but the below code does not hang with eventlet 0.33.1 in py3.10
```
import eventlet
eventlet.
import socket
import select
def main():
s1 = socket.
s1.
print(
if __name__ == "__main__":
main()
```
Balazs Gibizer (balazs-gibizer) wrote (last edit ): | #13 |
Forcing a sleep just before https:/
yatin (yatinkarel) wrote : | #14 |
While checking another issue https:/
Tempest triggered security group delete but it timed out and retried and succeeded in the second attempt:-
2023-05-22 04:28:48.568 69207 WARNING urllib3.
2023-05-22 04:28:48.707 69207 INFO tempest.
The original request is stuck in nova/neutron and times out after 900s(client_
nova:-
May 22 04:27:48.532171 np0034092361 <email address hidden>[51094]: DEBUG nova.api.
May 22 04:42:48.780841 np0034092361 <email address hidden>[51094]: DEBUG neutronclient.
May 22 04:42:48.781975 np0034092361 <email address hidden>[51094]: INFO nova.api.
May 22 04:42:48.792104 np0034092361 <email address hidden>[51094]: Mon May 22 04:42:48 2023 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /compute/
May 22 04:42:48.792104 np0034092361 <email address hidden>[51094]: Mon May 22 04:42:48 2023 - uwsgi_response_
May 22 04:42:48.792104 np0034092361 <email address hidden>[51094]: CRITICAL nova [None req-fc94b238-
Frequencies:
11 occurrences out of the last 100 job runs: ovs-grenade- dvr-multinode --limit 100 --file controller/ logs/grenade. sh_log. txt 'test_security_ group_rules_ create .* FAILED'
logsearch log --project openstack/neutron --job neutron-
15 occurrences out of the last 20 failed job runs: ovs-grenade- dvr-multinode --limit 20 --result FAILURE --file controller/ logs/grenade. sh_log. txt 'test_security_ group_rules_ create .* FAILED'
logsearch log --project openstack/neutron --job neutron-