tripleo

[Master] CI jobs failing randomly as pcs resource operations/actions times out

Bug #1938283 reported by yatin on 2021-07-28

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	yatin	tripleo xena-3

Bug Description

The jobs fails randomly at different places like sometimes at deployment and sometimes while running tempest.

pcs resource status unhealthy on the jobs where it fails this way:-
Failed Resource Actions:
  * haproxy-bundle-podman-0_stop_0 on standalone 'error' (1): call=87, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:54Z', queued=0ms, exec=20002ms
  * galera-bundle-podman-0_stop_0 on standalone 'error' (1): call=76, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:06Z', queued=0ms, exec=20006ms
  * rabbitmq-bundle-podman-0_stop_0 on standalone 'error' (1): call=72, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:49:34Z', queued=0ms, exec=20005ms
  * redis-bundle-podman-0_stop_0 on standalone 'error' (1): call=78, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:06Z', queued=0ms, exec=20001ms
  * ovn-dbs-bundle-podman-0_stop_0 on standalone 'error' (1): call=88, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:54Z', queued=0ms, exec=20002ms
  * openstack-cinder-backup-podman-0_stop_0 on standalone 'error' (1): call=74, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:49:34Z', queued=0ms, exec=20004ms
  * openstack-cinder-volume-podman-0_stop_0 on standalone 'error' (1): call=82, status='Timed Out', exitreason='', last-rc-change='2021-07-28 06:50:28Z', queued=0ms, exec=20799ms

From pacemaker logs:-
Jul 28 06:49:48 standalone.localdomain pacemaker-controld [389429] (throttle_check_thresholds) info: Moderate CPU load detected: 11.400000
Jul 28 06:49:48 standalone.localdomain pacemaker-controld [389429] (throttle_send_command) info: New throttle mode: medium load (was negligible)
Jul 28 06:49:54 standalone.localdomain pacemaker-execd [389426] (child_timeout_callback) warning: rabbitmq-bundle-podman-0_stop_0 process (PID 567787) timed out
Jul 28 06:49:54 standalone.localdomain pacemaker-execd [389426] (operation_finished) warning: rabbitmq-bundle-podman-0_stop_0[567787] timed out after 20000ms
Jul 28 06:49:54 standalone.localdomain pacemaker-execd [389426] (log_finished) info: rabbitmq-bundle-podman-0 stop (call 72, PID 567787) exited with status 1

Jul 28 06:50:18 standalone.localdomain pacemaker-controld [389429] (throttle_check_thresholds) info: Moderate CPU load detected: 11.430000
Jul 28 06:50:23 podman(galera-bundle-podman-0)[569777]: INFO: 67d41dc0f29c51917d9fb5553925205f58fada6c72f4fedd267e867fdd28221c
Jul 28 06:50:24 podman(galera-bundle-podman-0)[569777]: NOTICE: Cleaning up inactive container, galera-bundle-podman-0.
Jul 28 06:50:26 standalone.localdomain pacemaker-execd [389426] (child_timeout_callback) warning: galera-bundle-podman-0_stop_0 process (PID 569777) timed out
Jul 28 06:50:26 standalone.localdomain pacemaker-execd [389426] (operation_finished) warning: galera-bundle-podman-0_stop_0[569777] timed out after 20000ms
Jul 28 06:50:26 standalone.localdomain pacemaker-execd [389426] (log_finished) info: galera-bundle-podman-0 stop (call 76, PID 569777) exited with status 1

Example logs:-
https://logserver.rdoproject.org/01/34701/1/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/5ba0b6f/logs/undercloud/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/01/34701/1/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/205cfe2/logs/undercloud/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/01/34701/1/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/7fffd3e/logs/undercloud/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/3a8026d/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/d6f41d6/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario001-standalone-master/e7a09e6/logs/undercloud/var/log/extra/pcs.txt.gz

https://review.opendev.org/c/openstack/tripleo-heat-templates/+/791416 triggered it as timeout setting changed, before this patch it's used to be 120s and now it fallback's to default 20s leading to failures when there is load on system.

Tags:

yatin (yatinkarel) on 2021-07-28

Changed in tripleo:
milestone:	none → xena-3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-28: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802696

Changed in tripleo:
status:	Triaged → In Progress

yatin (yatinkarel) on 2021-07-28

Changed in tripleo:
assignee:	nobody → yatin (yatinkarel)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-29: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802696
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/74ae036024b83f198c24a863dce51c948f2364eb
Submitter: "Zuul (22348)"
Branch: master

commit 74ae036024b83f198c24a863dce51c948f2364eb
Author: yatinkarel <email address hidden>
Date: Wed Jul 28 18:14:58 2021 +0530

Fix condition for pacemaker resource_op_defaults

    After [1] pacemaker resource_op_defaults timeout
    was not set correctly, update default value for
    PacemakerBundleOperationTimeout to '120s' and
    remove unnecessary conditions for timeout set
    and podman enabled as currently only podman
    is supported as container_cli.

Also update regex to not allow empty value for
PacemakerBundleOperationTimeout.

[1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/791416

Closes-Bug: #1938283
Change-Id: I97ab65eb5b5fa478d323be6f9f981a1e2a875f86

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-29: Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802825

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-29: Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802825
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/c4097a028e0b446eae7e0fb4a75c3a327ac4c55d
Submitter: "Zuul (22348)"
Branch: master

commit c4097a028e0b446eae7e0fb4a75c3a327ac4c55d
Author: yatinkarel <email address hidden>
Date: Thu Jul 29 10:43:09 2021 +0530

Make indentation consistent in pacemaker config_settings

Follow up of [1].

[1] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/802696

Related-Bug: #1938283
Change-Id: I2b6bfa9afe36c3ea489648aa9481924fb4bb2acb

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-18: Fix included in openstack/tripleo-heat-templates 15.1.0

This issue was fixed in the openstack/tripleo-heat-templates 15.1.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.