tripleo

M/N upgrades - relax pre-upgrade check for failed actions

Bug #1628653 reported by Michele Baldessari on 2016-09-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Michele Baldessari	tripleo newton-rc2 "newton-rc2"

Bug Description

So I'd like to start a discussion about potentially relaxing the pre-upgrade check for failed actions. The reason is the following bug: https://bugs.launchpad.net/tripleo/+bug/1628632

Basically on mitaka with ceph the following failed actions will be there right after the deployment:
Failed Actions:
* openstack-gnocchi-metricd_monitor_60000 on overcloud-controller-1 'not running' (7): call=358, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:04:55 2016', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-1 'not running' (7): call=277, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:01:49 2016', queued=1ms, exec=2125ms
* openstack-gnocchi-metricd_monitor_60000 on overcloud-controller-0 'not running' (7): call=364, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:04:55 2016', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-0 'not running' (7): call=280, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:01:49 2016', queued=0ms, exec=2138ms
* openstack-gnocchi-metricd_monitor_60000 on overcloud-controller-2 'not running' (7): call=353, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:04:55 2016', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-2 'not running' (7): call=272, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:01:49 2016', queued=1ms, exec=2152ms

If the operator takes no action (maybe because he was not using gnocchi & co), the upgrade will fail in the precheck for the failed actions.

Should we care about this situation or we simply need to fix the above bug and the operator *must* make sure there are no failed actions?

On one side I'd prefer a clean fix where it belongs (aka gnocchi/mitaka), on the other hand a failed action might actually have happened in a distant past and currently all resources are up and running, so it is a bit of a big hammer to stop an upgrade because of that?

Tags:

Revision history for this message

Michele Baldessari (michele) wrote on 2016-09-28:

So I definitely think we should tweak this. I had at least one upgrade job failing because of failed resources:
Failed Actions:
* memcached_monitor_60000 on overcloud-controller-1 'not running' (7): call=41, status=complete, exitreason='none', last-rc-change='Wed Sep 28 18:58:44 2016', queued=0ms, exec=0ms
* mongod_monitor_60000 on overcloud-controller-1 'not running' (7): call=82, status=complete, exitreason='none',last-rc-change='Wed Sep 28 18:58:06 2016', queued=0ms, exec=0ms
....

But we actually had all the resources running:
[root@overcloud-controller-0 ~]# pcs status |grep -i stopped
[root@overcloud-controller-0 ~]#

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-28: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/378998

Changed in tripleo:
assignee:	nobody → Michele Baldessari (michele)
status:	New → In Progress

Michele Baldessari (michele) on 2016-09-29

Changed in tripleo:
importance:	Medium → High
milestone:	none → newton-rc2

Revision history for this message

Michele Baldessari (michele) wrote on 2016-09-29:

Changing prio because there is actually a race in there

Changed in tripleo:
importance:	High → Critical

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-29: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/378998
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=32c54304f489405ea2e3ab67f5de236ab6f2e5ec
Submitter: Jenkins
Branch: master

commit 32c54304f489405ea2e3ab67f5de236ab6f2e5ec
Author: Michele Baldessari <email address hidden>
Date: Wed Sep 28 22:55:25 2016 +0200

Relax pre-upgrade check for failed actions

    Before this change we checked the cluster for any failed actions and
    we stopped the upgrade process if there were any.
    This is likely eccessive as a failed action could have happened in the
    past and the cluster is now fully functional.

Better to check if any of the resources are in Stopped state and break
the upgrade process if any of them are.

    We also need to restrict this check to the bootstrap node because
    otherwise the following might happen:
    1) Bootstrap node does the check, it is successful and it starts
       the full HA -> HA NG migration which *will* create failed actions
       and will start stopping resources
    2) If the check now starts on a non-bootstrap node while 1) is ongoing,
       it will find either failed actions or stopped resources so it will
       fail.

Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f
Closes-Bug: #1628653

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-29: Fix included in openstack/tripleo-heat-templates 5.0.0.0rc2

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0rc2 release candidate.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.