tempest jobs and tests are more frequent timeout since 2023.1 release

Bug #2004780 reported by Ghanshyam Mann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
Confirmed
Critical
Ghanshyam Mann

Bug Description

In 2023.1 cycle (mentioning the release cycle as we moved our testing from ubutnu 20.04 to 22.04 in this cycle), we are seeing more timeout in tests (and so does sometimes it end up to job timeout) in various jobs. A few examples are

- miltinode job - https://zuul.openstack.org/build/084729c5a1fe46e091a17a50441e99d5/log/job-output.txt

- stable/yoga job - https://zuul.opendev.org/t/openstack/build/9b79f54d3f6c4e05ae6619fdac7ad95a

One test in doubt and taking more time is test_minimum_basic_instance_hard_reboot_after_vol_snap_deletion

taking 258.358054s - https://zuul.openstack.org/build/084729c5a1fe46e091a17a50441e99d5/log/job-output.txt#34173

taking 187.411266s - https://zuul.opendev.org/t/openstack/build/9b79f54d3f6c4e05ae6619fdac7ad95a/log/job-output.txt#25738

But there might be other tests causing the timeout.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/872691

Changed in tempest:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Ghanshyam Mann (ghanshyammann)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/873055

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/873163

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/873441

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/873442

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.opendev.org/c/openstack/tempest/+/872691
Committed: https://opendev.org/openstack/tempest/commit/b3da2e19cd04c7cc463cc16e058e506d49a3ba3d
Submitter: "Zuul (22348)"
Branch: master

commit b3da2e19cd04c7cc463cc16e058e506d49a3ba3d
Author: Ghanshyam Mann <email address hidden>
Date: Fri Feb 3 13:20:48 2023 -0600

    Mark test_minimum_basic_instance_hard_reboot_after_vol_snap_deletion as slow test

    We are seeing more timeout in tests (and so does sometimes it end up to
    job timeout) in various jobs. A few examples are

    - miltinode job
      - https://zuul.openstack.org/build/084729c5a1fe46e091a17a50441e99d5/log/job-output.txt

    - stable/yoga job
      - https://zuul.opendev.org/t/openstack/build/9b79f54d3f6c4e05ae6619fdac7ad95a

    Two tests in doubt and taking more time is
    test_minimum_basic_instance_hard_reboot_after_vol_snap_deletion
    - taking 258.358054s
      - https://zuul.openstack.org/build/084729c5a1fe46e091a17a50441e99d5/log/job-output.txt#34173
    - taking 187.411266s
      - https://zuul.opendev.org/t/openstack/build/9b79f54d3f6c4e05ae6619fdac7ad95a/log/job-output.txt#25738

    test_minimum_basic_scenario
    - taking 309.109043s
      - https://zuul.opendev.org/t/openstack/build/d068cb494d234fe7b79dc5ae6fd6ae69/log/job-output.txt#24052

    marking these test as slot test and monitor if there is another slow
    test we can find.

    Related-Bug: #2004780
    Change-Id: I0aff3507b3bf3498ab0ecd548bb57cdcd97ec11a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/873472

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.opendev.org/c/openstack/tempest/+/873163
Committed: https://opendev.org/openstack/tempest/commit/a9bad0051255327e0a0456a0d46c34f1a6ed4c79
Submitter: "Zuul (22348)"
Branch: master

commit a9bad0051255327e0a0456a0d46c34f1a6ed4c79
Author: Ghanshyam Mann <email address hidden>
Date: Wed Feb 8 14:13:48 2023 -0600

    Move a few jobs to periodic

    We have a few jobs running gate even they are non voting
    and run in periodic as well. They do not need to run in
    every change but running/checking them in periodic run
    is enough coverage. Below are the jobs moving to periodic:

    * tempest-full-py3-ipv6
    We do run tempest-ipv6-only job as voting in gate to
    cover the ipv6 run and tempest-full-py3-ipv6 job can
    run perdiocially to test the full tempest on ipv6.

    *tempest-full-centos-9-stream
    we already discussed and agreed in TC also that centos stream
    testing is best effort and can be in periodic or non voting.

    *tempest-full-test-account-no-admin-py3
    Checking if tempest can be run without admin in periodic and not
    on every change is enough.

    * tempest-full-yoga
    We do run all supported stable branch jobs periodically and running only
    latest and oldest supported in check pipeline should be enough to catch
    any breaking change on stable branches.

    Relavant-Bug: #2004780
    Change-Id: I8a2da7288e3f2264ce3cc39115c1d807b21fff95

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/tempest/+/873441
Committed: https://opendev.org/openstack/tempest/commit/6bb98c2aa478f7ad32838fec4b59c4acb73ccf21
Submitter: "Zuul (22348)"
Branch: master

commit 6bb98c2aa478f7ad32838fec4b59c4acb73ccf21
Author: Ghanshyam Mann <email address hidden>
Date: Fri Feb 10 18:22:02 2023 -0600

    Prepare tempest-slow-parallel job and run periodically

    tempest-slow-py3 job run all the slow test serially which
    takes lot of time and end up job timeout. This preparing
    tempest-slow-parallel job which will run slow tests parallelly
    in periodic run. Based on the results, later we can make
    tempest-slow-py3 job to run tests in parallel.

    Also, run tempest-full-parallel in periodic and based on the
    result we can run tempest-full-py3 job scenario tests in parallel.

    Relavant-Bug: #2004780

    Change-Id: I876dacb40daa384cddc8faae3200cd3d39506ddc

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/tempest/+/873442
Committed: https://opendev.org/openstack/tempest/commit/e2183ca8f6562675ac0c31583be8316e4ffec161
Submitter: "Zuul (22348)"
Branch: master

commit e2183ca8f6562675ac0c31583be8316e4ffec161
Author: Ghanshyam Mann <email address hidden>
Date: Fri Feb 10 19:31:52 2023 -0600

    Minimize the tests footprint in multinode job

    multinode job run all the tests including multinode and
    non multinode tests. But we do not need to run all the
    non multinode tests in this job instead smoke tests along
    with multinode tests should be enough to run. This make
    multinode jobs to run only smoke and multinode tests. For
    that, we need to tag the multinode tests with 'multinode' attr.

    Relavant-Bug: #2004780
    Change-Id: I7e87d1db3ef3a00b3d27f0904d0af6a270e03837

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.opendev.org/c/openstack/tempest/+/873055
Committed: https://opendev.org/openstack/tempest/commit/518e426ab4ff28db11654f8309241ab215b0e42b
Submitter: "Zuul (22348)"
Branch: master

commit 518e426ab4ff28db11654f8309241ab215b0e42b
Author: Ghanshyam Mann <email address hidden>
Date: Fri Feb 10 19:57:36 2023 -0600

    Separate the extra tests to run in a separate job

    Recently we are seeing a lot of job timeout(bug#2004780)
    and we see many tests taking time and also number of tests
    increasing over time. This is to prepare the list of extra tests
    (here extra tests means the tests which are covered by the other
    API or scenario tests) which we do not need to run in every
    integrated jobs. Instead, we can run them in a separete job(s).

    Currently I am adding admin (except keystone) and negative tests
    in the 'extra tests' list but we can add more tests here which
    we think are covered in some other tests.

    As negative tests are important for interop, adding those extra
    tests coverage for stable branch job also but running them in
    periodic run only.

    Related-Bug: #2004780
    Change-Id: Id02221df0d6180519751c63e890851bd59fdafa0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tempest (master)

Change abandoned by "Ghanshyam <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/873472

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote : Re: tempest jobs and tests are more frequent timeout in 2023.1 cycle

The below changes improved the tempest-slow-py3 job time. That is reduced to almost half now.

https://review.opendev.org/c/openstack/tempest/+/887237/4

https://review.opendev.org/c/openstack/tempest/+/889129/2

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

Other attempt to improve the job execution time by increasing the default concurrency

https://review.opendev.org/c/openstack/tempest/+/887220/9

But test execution balance among worker still not bexst.

Changed in tempest:
importance: High → Critical
Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

changing it to Critical as it is almost every changes in tempest hitting it and difficult to merge the changes now.

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

Below are the refactoring of tests to execute the resource creation and waiting for status in parallel. That increase the test execution time.

- https://review.opendev.org/q/topic:split-sat
- https://review.opendev.org/c/openstack/tempest/+/889207

summary: - tempest jobs and tests are more frequent timeout in 2023.1 cycle
+ tempest jobs and tests are more frequent timeout since 2023.1 release
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/890469

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

these are the few more changes solving this issue
- https://review.opendev.org/q/topic:tempest-job-timeout

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/890573

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.opendev.org/c/openstack/tempest/+/890469
Committed: https://opendev.org/openstack/tempest/commit/fd90dacc8e1edfa407eaede4a83147d3f87ef424
Submitter: "Zuul (22348)"
Branch: master

commit fd90dacc8e1edfa407eaede4a83147d3f87ef424
Author: Ghanshyam Mann <email address hidden>
Date: Thu Aug 3 16:26:44 2023 -0700

    Skip test early to improve memory footprint and time

    When we skip the test class using skip_checks(), it check
    the conditions and skip the test class at first step
    without creating any keystone credentials. But when
    tests are skipped with other decorator at test level then
    it does create keystone credentials, setup network resources
    and service clients.

    Wehn all the tests in test class are skipped based on common
    condition then it is better to skip them using the skip_check
    so that we do not create any keystone, network resources which
    will improve the DB queries keystone, neutron does and also
    speed up the test skip.

    Related-Bug: #2004780
    Change-Id: Id5e6ddcb83aaa6133c28ef188183d98e26e4925b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/tempest/+/890573
Committed: https://opendev.org/openstack/tempest/commit/2803b57d6cbdf8281a0a495595c071f49e915042
Submitter: "Zuul (22348)"
Branch: master

commit 2803b57d6cbdf8281a0a495595c071f49e915042
Author: Ghanshyam Mann <email address hidden>
Date: Fri Aug 4 12:11:59 2023 -0700

    Skip scenario tests early to avoid unnecessary setup

    This is change for the volume scenario test to
    skip them early.

    When we skip the test class using skip_checks(), it check
    the conditions and skip the test class at first step
    without creating any keystone credentials. But when
    tests are skipped with other decorator at test level then
    it does create keystone credentials, setup network resources
    and service clients.

    This will mostly help neutron gate where these volume
    tests will be skipped in the initial stage only and will
    not create the keystone and network resources.

    One good example is TestEncryptedCinderVolumes which is skipped
     - https://zuul.openstack.org/build/babcc06f24764a408ed77702365b4c5b/log/job-output.txt#28695

    But it still does the resources setup
    - https://zuul.openstack.org/build/babcc06f24764a408ed77702365b4c5b/log/controller/logs/tempest_log.txt#6374-6450

    Related-Bug: #2004780
    Change-Id: I59cd39c20b995bf2ed2f58f4522743c3ca51b516

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tempest/+/890689

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.opendev.org/c/openstack/tempest/+/890689
Committed: https://opendev.org/openstack/tempest/commit/68a25ef77be9734e5675d50efd9fe4050477c054
Submitter: "Zuul (22348)"
Branch: master

commit 68a25ef77be9734e5675d50efd9fe4050477c054
Author: Ghanshyam Mann <email address hidden>
Date: Mon Aug 7 10:07:12 2023 -0700

    Setting Tempest run concurrency to 4 for a few jobs

    We recently changed the default concurrency to the higher
    value (number of cpu -2) which end up 6 in upstream CI.
    Higher concurrency means high parallel requests to services
    and can cause more oom issues. To avoid the oom issue, setting
    the concurrency to 4 in a few of the jobs which run more
    parallel tests

    Related-Bug: #2004780
    Change-Id: Ifa2ac35453e17ca01378ebebb310a4719b704fef

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.