[Master] CI jobs failing randomly as pcs resource operations/actions times out

Bug #1938283 reported by yatin
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
yatin

Bug Description

The jobs fails randomly at different places like sometimes at deployment and sometimes while running tempest.

pcs resource status unhealthy on the jobs where it fails this way:-
Failed Resource Actions:
  * haproxy-bundle-podman-0_stop_0 on standalone 'error' (1): call=87, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:54Z', queued=0ms, exec=20002ms
  * galera-bundle-podman-0_stop_0 on standalone 'error' (1): call=76, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:06Z', queued=0ms, exec=20006ms
  * rabbitmq-bundle-podman-0_stop_0 on standalone 'error' (1): call=72, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:49:34Z', queued=0ms, exec=20005ms
  * redis-bundle-podman-0_stop_0 on standalone 'error' (1): call=78, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:06Z', queued=0ms, exec=20001ms
  * ovn-dbs-bundle-podman-0_stop_0 on standalone 'error' (1): call=88, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:54Z', queued=0ms, exec=20002ms
  * openstack-cinder-backup-podman-0_stop_0 on standalone 'error' (1): call=74, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:49:34Z', queued=0ms, exec=20004ms
  * openstack-cinder-volume-podman-0_stop_0 on standalone 'error' (1): call=82, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:28Z', queued=0ms, exec=20799ms

From pacemaker logs:-
Jul 28 06:49:48 standalone.localdomain pacemaker-controld [389429] (throttle_check_thresholds) info: Moderate CPU load detected: 11.400000
Jul 28 06:49:48 standalone.localdomain pacemaker-controld [389429] (throttle_send_command) info: New throttle mode: medium load (was negligible)
Jul 28 06:49:54 standalone.localdomain pacemaker-execd [389426] (child_timeout_callback) warning: rabbitmq-bundle-podman-0_stop_0 process (PID 567787) timed out
Jul 28 06:49:54 standalone.localdomain pacemaker-execd [389426] (operation_finished) warning: rabbitmq-bundle-podman-0_stop_0[567787] timed out after 20000ms
Jul 28 06:49:54 standalone.localdomain pacemaker-execd [389426] (log_finished) info: rabbitmq-bundle-podman-0 stop (call 72, PID 567787) exited with status 1

Jul 28 06:50:18 standalone.localdomain pacemaker-controld [389429] (throttle_check_thresholds) info: Moderate CPU load detected: 11.430000
Jul 28 06:50:23 podman(galera-bundle-podman-0)[569777]: INFO: 67d41dc0f29c51917d9fb5553925205f58fada6c72f4fedd267e867fdd28221c
Jul 28 06:50:24 podman(galera-bundle-podman-0)[569777]: NOTICE: Cleaning up inactive container, galera-bundle-podman-0.
Jul 28 06:50:26 standalone.localdomain pacemaker-execd [389426] (child_timeout_callback) warning: galera-bundle-podman-0_stop_0 process (PID 569777) timed out
Jul 28 06:50:26 standalone.localdomain pacemaker-execd [389426] (operation_finished) warning: galera-bundle-podman-0_stop_0[569777] timed out after 20000ms
Jul 28 06:50:26 standalone.localdomain pacemaker-execd [389426] (log_finished) info: galera-bundle-podman-0 stop (call 76, PID 569777) exited with status 1

Example logs:-
https://logserver.rdoproject.org/01/34701/1/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/5ba0b6f/logs/undercloud/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/01/34701/1/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/205cfe2/logs/undercloud/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/01/34701/1/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/7fffd3e/logs/undercloud/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/3a8026d/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/d6f41d6/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario001-standalone-master/e7a09e6/logs/undercloud/var/log/extra/pcs.txt.gz

https://review.opendev.org/c/openstack/tripleo-heat-templates/+/791416 triggered it as timeout setting changed, before this patch it's used to be 120s and now it fallback's to default 20s leading to failures when there is load on system.

yatin (yatinkarel)
Changed in tripleo:
milestone: none → xena-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)
Changed in tripleo:
status: Triaged → In Progress
yatin (yatinkarel)
Changed in tripleo:
assignee: nobody → yatin (yatinkarel)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802696
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/74ae036024b83f198c24a863dce51c948f2364eb
Submitter: "Zuul (22348)"
Branch: master

commit 74ae036024b83f198c24a863dce51c948f2364eb
Author: yatinkarel <email address hidden>
Date: Wed Jul 28 18:14:58 2021 +0530

    Fix condition for pacemaker resource_op_defaults

    After [1] pacemaker resource_op_defaults timeout
    was not set correctly, update default value for
    PacemakerBundleOperationTimeout to '120s' and
    remove unnecessary conditions for timeout set
    and podman enabled as currently only podman
    is supported as container_cli.

    Also update regex to not allow empty value for
    PacemakerBundleOperationTimeout.

    [1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/791416

    Closes-Bug: #1938283
    Change-Id: I97ab65eb5b5fa478d323be6f9f981a1e2a875f86

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802825
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/c4097a028e0b446eae7e0fb4a75c3a327ac4c55d
Submitter: "Zuul (22348)"
Branch: master

commit c4097a028e0b446eae7e0fb4a75c3a327ac4c55d
Author: yatinkarel <email address hidden>
Date: Thu Jul 29 10:43:09 2021 +0530

    Make indentation consistent in pacemaker config_settings

    Follow up of [1].

    [1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802696

    Related-Bug: #1938283
    Change-Id: I2b6bfa9afe36c3ea489648aa9481924fb4bb2acb

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 15.1.0

This issue was fixed in the openstack/tripleo-heat-templates 15.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.