periodic featureset 35 wallaby times out running tempest (2 hours)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Unassigned |
Bug Description
At [1][2][3][4] the periodic-
2021-08-05 00:28:05.041522 | primary | TASK [os_tempest : Execute tempest tests] *******
2021-08-05 00:28:05.041528 | primary | Thursday 05 August 2021 00:28:05 +0000 (0:00:00.048) 1:50:45.136 *******
2021-08-05 02:22:27.036537 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.
Cant quickly see something useful from the tempest run logs [5] and tempestconf looks to have completed OK [6]
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
Marios Andreou (marios-b) wrote (last edit ): | #1 |
Marios Andreou (marios-b) wrote : | #2 |
I can't see any major difference in the nodes between a good log [1] and a timeout out [2], except the timeout one has less free memory (but same total)
[1] * MemTotal: 8150828 kB
MemFree: 1240728 kB
[2] * MemTotal: 8150828 kB
MemFree: 293704 kB
Similarly the cpuinfo log looks the same good @ [3] bad at [4]
I see in the errors log an issue reaching rabbit on controller-1 with retries, I don't know if that is directly related
2021-08-05 18:18:42.101 ERROR /var/log/
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
Marios Andreou (marios-b) wrote : | #3 |
it also seems to be inconsistent :/
among the TIMED_OUT we also have a couple of success from Saturday 7th
* https:/
* 3 hrs 49 mins 3 secs 2021-08-07 22:11:54 SUCCESS
* 3 hrs 48 mins 3 secs 2021-08-07 16:36:46 SUCCESS
but they are taking close to 4 hours so pretty close to the timeout which is 4 hours (inherited from https:/
so why is it taking almost 2 hours to run tempest it seems excessive
used to take closer to 1 hour.
Marios Andreou (marios-b) wrote : | #4 |
Based on comment #1 and attached screen shot this started ~3rd August. I compared 2 'good runs' one that took close to 3 hours from 2/3 August [1] and another recent one from yesterday 9th august [2]
From [1]
3 hrs 9 mins 1 sec 2021-08-02 22:22:27
Ran: 1416 tests in 3475.6904 sec.
- Passed: 1295
- Skipped: 121
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 8074.3680 sec.
- Worker 0 (428 tests) => 0:57:48.876694
- Worker 1 (334 tests) => 0:50:33.366087
- Worker 2 (367 tests) => 0:39:59.810439
- Worker 3 (287 tests) => 0:49:15.021685
From [2]
Ran: 1416 tests in 7587.3411 sec.
- Passed: 1295
- Skipped: 121
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 16297.8125 sec.
- Worker 0 (362 tests) => 1:59:08.312166
- Worker 1 (356 tests) => 1:08:44.041797
- Worker 2 (426 tests) => 2:06:16.196664
- Worker 3 (272 tests) => 1:21:51.781254
As can be seen in 2 the same tests take twice as long to complete. You can see more about the timings at the stackviz logs [3] ('good' ~1 hour tempest run) and [4] (bad ~2 hours tempest)
Sagi (Sergey) Shnaidman (sshnaidm) wrote : | #5 |
Ronelle Landy (rlandy) wrote : | #6 |
From https:/
Also noticed a difference in the openvswitch versions from August 3rd:
network-
openvswitch-
openvswitch2.
-------
network-
openvswitch-
openvswitch2.
https:/
matching that new build.
Maybe we downgrade openvswitch and see if we do better?
chandan kumar (chkumar246) wrote : | #7 |
Since openvswitch2.
SO ovs update is might not be the culprit.
Martin Kopec (mkopec) wrote : | #8 |
Many tests just take longer, f.e:
test_dhcp6_
116.8 seconds -> 206.8 seconds
test_dualnet_
134.2 seconds -> 245.8 seconds
test_dualnet_
161.2 seconds -> 254.8 seconds
Seems like all requests, especially GET ones are taking much longer, comparison of requests within test_dualnet_
$ cut -d" " -f12- good_r
200 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
201 POST https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
201 POST https://[2001:db8:
201 POST https://[2001:db8:
202 POST https://[2001:db8:
200 GET https://[2001:db8:
200 GET https://[2001:db8:
yatin (yatinkarel) wrote : | #9 |
So it's not just wallaby, xena is also impacted. Since https:/
For example:-
PASSING JOB:-
$ grep -nr "GET /v2.1/os-
0.273092
0.259492
0.238887
$ grep -nr "DELETE /v2.1/os-
0.368750
0.063184
0.254873
0.391544
0.331399
0.244026
0.425756
0.453150
0.236778
0.270068
0.586434
0.037038
0.045581
0.648366
0.236613
0.221590
FAILING JOB:-
$ grep -nr "GET /v2.1/os-
3.625276
2.823670
4.931641
2.061295
$ grep -nr "DELETE /v2.1/os-
3.348766
2.595817
2.260266
2.017600
1.669031
3.165000
3.684208
2.686081
1.148234
2.810574
1.752639
2.078275
2.003472
Ronelle Landy (rlandy) wrote : | #10 |
https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart (master) | #11 |
Fix proposed to branch: master
Review: https:/
Changed in tripleo: | |
status: | Triaged → In Progress |
Grzegorz Grasza (xek) wrote : | #12 |
I tested a different way of disabling FQDNs in memcache server list configuration here:
https:/
The first successful run is without any change and the second one switches to IPs without reverting the large patch.
The second run finished faster by 22 minutes.
Slawek Kaplonski (slaweq) wrote : | #13 |
I looked at logs from the job https:/
In nsswitch.conf file there is:
hosts: files dns myhostname
So resolve of the names should be first done using /etc/hosts file and in this file there are entries for controllers like overcloud-
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master) | #14 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 1ce490716d3ff0a
Author: Grzegorz Grasza <email address hidden>
Date: Wed Aug 25 09:20:06 2021 +0200
Environment for switching to using IPs for memcached
Related-Bug: #1939023
Change-Id: Iaadee6be4e1eaf
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/wallaby) | #15 |
Related fix proposed to branch: stable/wallaby
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/victoria) | #16 |
Related fix proposed to branch: stable/victoria
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri) | #17 |
Related fix proposed to branch: stable/ussuri
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train) | #18 |
Related fix proposed to branch: stable/train
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/wallaby) | #19 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/wallaby
commit 2456e5930119030
Author: Grzegorz Grasza <email address hidden>
Date: Wed Aug 25 09:20:06 2021 +0200
Environment for switching to using IPs for memcached
Related-Bug: #1939023
Change-Id: Iaadee6be4e1eaf
(cherry picked from commit 1ce490716d3ff0a
tags: | added: in-stable-wallaby |
tags: | added: in-stable-victoria |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/victoria) | #20 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/victoria
commit 4cbc970d15066aa
Author: Grzegorz Grasza <email address hidden>
Date: Wed Aug 25 09:20:06 2021 +0200
Environment for switching to using IPs for memcached
Related-Bug: #1939023
Change-Id: Iaadee6be4e1eaf
(cherry picked from commit 1ce490716d3ff0a
Slawek Kaplonski (slaweq) wrote : | #21 |
Today I checked logs from the job https:/
I found out that there are some tests which runs very long time, like e.g. tempest.
I compared this with u/s job and the same test took about 18 seconds.
Now, I checked in tempest logs, what took so long in that test and here is what I found:
zgrep test_associate_
2021-09-07 12:11:47.017 321761 INFO tempest.
2021-09-07 12:11:47.019 321761 INFO tempest.
2021-09-07 12:11:53.102 321761 INFO tempest.
2021-09-07 12:12:01.016 321761 INFO tempest.
2021-09-07 12:12:05.347 321761 INFO tempest.
2021-09-07 12:12:09.281 321761 INFO tempest.lib...
Bogdan Dobrelya (bogdando) wrote : | #22 |
@Slawek, did you testing show different results to what was brought in https:/
Bogdan Dobrelya (bogdando) wrote : | #23 |
I can't see the extr env file to switch memcached to use IPs there https:/
could you please adjust the job and retry it with environments/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train) | #24 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/train
commit 3d637e176178e94
Author: Grzegorz Grasza <email address hidden>
Date: Wed Aug 25 09:20:06 2021 +0200
Environment for switching to using IPs for memcached
Related-Bug: #1939023
Change-Id: Iaadee6be4e1eaf
(cherry picked from commit 1ce490716d3ff0a
tags: | added: in-stable-train |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri) | #25 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/ussuri
commit 750877f25e27b0f
Author: Grzegorz Grasza <email address hidden>
Date: Wed Aug 25 09:20:06 2021 +0200
Environment for switching to using IPs for memcached
Related-Bug: #1939023
Change-Id: Iaadee6be4e1eaf
(cherry picked from commit 1ce490716d3ff0a
tags: | added: in-stable-ussuri |
yatin (yatinkarel) wrote : | #26 |
<< could you please adjust the job and retry it with environments/
https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart (master) | #27 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 4b2454350289fe2
Author: Grzegorz Grasza <email address hidden>
Date: Wed Aug 25 09:23:52 2021 +0200
Use IPs instead of FQDNs in memcached with IPv6
Change-Id: I34c6a4d9e64e13
Resolves-Bug: #1939023
Depends-On: https:/
Changed in tripleo: | |
status: | In Progress → Fix Released |
Attila Fazekas (afazekas) wrote : | #28 |
Probably you want to switch to a different memcached library:
https:/
The current one is not prepared for if a name resolves to an ipv6 address,
but works with ipv6: addresses when configured by ip.
Slawek Kaplonski (slaweq) wrote : | #29 |
I was trying to reproduce that issue today but wasn't able to reproduce and investigate that issue. When I run it on test patch, tempest ended up for me in about 4300 seconds
======
Totals
======
Ran: 1425 tests in 4297.8985 sec.
- Passed: 1303
- Skipped: 121
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 1
Sum of execute time for each test: 8342.5592 sec.
I also checked builds history https:/
Next I compared time of the test execution in the fast (https:/
Lee Yarwood (lyarwood) wrote : | #30 |
As discussed downstream this appears to be the result of the environments/
Ultimately we either need to remove this environment *or* if that's not possible, increase timeouts for the individual test and overall test run.
Bogdan Dobrelya (bogdando) wrote : | #31 |
As a related to this issue, we should switch the environments/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master) | #32 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master) | #33 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master) | #34 |
Change abandoned by "Bogdan Dobrelya <email address hidden>" on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master) | #35 |
Change abandoned by "Bogdan Dobrelya <email address hidden>" on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master) | #36 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master) | #37 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/wallaby) | #38 |
Change abandoned by "Takashi Kajinami <email address hidden>" on branch: stable/wallaby
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master) | #39 |
Change abandoned by "chandan kumar <email address hidden>" on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master) | #40 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ci (master) | #41 |
Change abandoned by "Bogdan Dobrelya <email address hidden>" on branch: master
Review: https:/
Reason: I don't think this is needed, let's tweak on a fs/job basisc
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master) | #42 |
Change abandoned by "Bogdan Dobrelya <email address hidden>" on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ci (master) | #43 |
Change abandoned by "Bogdan Dobrelya <email address hidden>" on branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master) | #44 |
Change abandoned by "Bogdan Dobrelya <email address hidden>" on branch: master
Review: https:/
As can be seen in the attached screen shot from [1] the successful runs on this job are usually closer to ~3 hours. The timeouts started on 3rd August.
We *are* running a lot of tempest tests here [2] but that list of tests has not been altered recently and used to complete well within timeout.
Comparing to a green run at [3] the tempest tests usually take ~ 1hour to run:
* 2021-08-03 00:18:37.735157 | primary | TASK [os_tempest : Execute tempest tests] ******* ******* ******* ******* ******* ***
2021-08-03 00:18:37.735168 | primary | Tuesday 03 August 2021 00:18:37 +0000 (0:00:00.041) 1:39:33.632 ********
2021-08-03 01:16:39.798815 | primary | ok: [undercloud]
but as per this bug they are now timing out after 2 hours.
[1] https:/ /review. rdoproject. org/zuul/ builds? job_name= periodic- tripleo- ci-centos- 8-ovb-3ctlr_ 1comp-featurese t035-wallaby /github. com/openstack/ tripleo- quickstart/ blob/444fcff6b1 7b77778382cd0be 5a45f7b85a7b7ca /config/ general_ config/ featureset035. yml#L175- L179 /logserver. rdoproject. org/openstack- periodic- integration- stable1/ opendev. org/openstack/ tripleo- ci/master/ periodic- tripleo- ci-centos- 8-ovb-3ctlr_ 1comp-featurese t035-wallaby/ d385cc6/ job-output. txt
[2] https:/
[3] https:/