M/N upgrades - relax pre-upgrade check for failed actions

Bug #1628653 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

So I'd like to start a discussion about potentially relaxing the pre-upgrade check for failed actions. The reason is the following bug: https://bugs.launchpad.net/tripleo/+bug/1628632

Basically on mitaka with ceph the following failed actions will be there right after the deployment:
Failed Actions:
* openstack-gnocchi-metricd_monitor_60000 on overcloud-controller-1 'not running' (7): call=358, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:04:55 2016', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-1 'not running' (7): call=277, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:01:49 2016', queued=1ms, exec=2125ms
* openstack-gnocchi-metricd_monitor_60000 on overcloud-controller-0 'not running' (7): call=364, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:04:55 2016', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-0 'not running' (7): call=280, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:01:49 2016', queued=0ms, exec=2138ms
* openstack-gnocchi-metricd_monitor_60000 on overcloud-controller-2 'not running' (7): call=353, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:04:55 2016', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_start_0 on overcloud-controller-2 'not running' (7): call=272, status=complete, exitreason='none',
    last-rc-change='Wed Sep 28 19:01:49 2016', queued=1ms, exec=2152ms

If the operator takes no action (maybe because he was not using gnocchi & co), the upgrade will fail in the precheck for the failed actions.

Should we care about this situation or we simply need to fix the above bug and the operator *must* make sure there are no failed actions?

On one side I'd prefer a clean fix where it belongs (aka gnocchi/mitaka), on the other hand a failed action might actually have happened in a distant past and currently all resources are up and running, so it is a bit of a big hammer to stop an upgrade because of that?

Tags: upgrade
Revision history for this message
Michele Baldessari (michele) wrote :

So I definitely think we should tweak this. I had at least one upgrade job failing because of failed resources:
Failed Actions:
* memcached_monitor_60000 on overcloud-controller-1 'not running' (7): call=41, status=complete, exitreason='none', last-rc-change='Wed Sep 28 18:58:44 2016', queued=0ms, exec=0ms
* mongod_monitor_60000 on overcloud-controller-1 'not running' (7): call=82, status=complete, exitreason='none',last-rc-change='Wed Sep 28 18:58:06 2016', queued=0ms, exec=0ms
....

But we actually had all the resources running:
[root@overcloud-controller-0 ~]# pcs status |grep -i stopped
[root@overcloud-controller-0 ~]#

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/378998

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: New → In Progress
Changed in tripleo:
importance: Medium → High
milestone: none → newton-rc2
Revision history for this message
Michele Baldessari (michele) wrote :

Changing prio because there is actually a race in there

Changed in tripleo:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/378998
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=32c54304f489405ea2e3ab67f5de236ab6f2e5ec
Submitter: Jenkins
Branch: master

commit 32c54304f489405ea2e3ab67f5de236ab6f2e5ec
Author: Michele Baldessari <email address hidden>
Date: Wed Sep 28 22:55:25 2016 +0200

    Relax pre-upgrade check for failed actions

    Before this change we checked the cluster for any failed actions and
    we stopped the upgrade process if there were any.
    This is likely eccessive as a failed action could have happened in the
    past and the cluster is now fully functional.

    Better to check if any of the resources are in Stopped state and break
    the upgrade process if any of them are.

    We also need to restrict this check to the bootstrap node because
    otherwise the following might happen:
    1) Bootstrap node does the check, it is successful and it starts
       the full HA -> HA NG migration which *will* create failed actions
       and will start stopping resources
    2) If the check now starts on a non-bootstrap node while 1) is ongoing,
       it will find either failed actions or stopped resources so it will
       fail.

    Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f
    Closes-Bug: #1628653

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0rc2

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0rc2 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.