OpenStack Compute (nova)

Comment 21 for bug 1784353

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-09: Fix merged to nova (stable/queens)

#21

Reviewed: https://review.openstack.org/612496
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3a0f26c822912dbaad7c8a1d6a3c599025afab37
Submitter: Zuul
Branch: stable/queens

commit 3a0f26c822912dbaad7c8a1d6a3c599025afab37
Author: Lee Yarwood <email address hidden>
Date: Mon Jul 30 13:41:35 2018 +0100

conductor: Recreate volume attachments during a reschedule

    When an instance with attached volumes fails to spawn, cleanup code
    within the compute manager (_shutdown_instance called from
    _build_resources) will delete the volume attachments referenced by
    the bdms in Cinder. As a result we should check and if necessary
    recreate these volume attachments when rescheduling an instance.

    Note that there are a few different ways to fix this bug by
    making changes to the compute manager code, either by not deleting
    the volume attachment on failure before rescheduling [1] or by
    performing the get/create check during each build after the
    reschedule [2].

    The problem with *not* cleaning up the attachments is if we don't
    reschedule, then we've left orphaned "reserved" volumes in Cinder
    (or we have to add special logic to tell compute when to cleanup
    attachments).

    The problem with checking the existence of the attachment on every
    new host we build on is that we'd be needlessly checking that for
    initial creates even if we don't ever need to reschedule, unless
    again we have special logic against that (like checking to see if
    we've rescheduled at all).

Also, in either case that involves changes to the compute means that
older computes might not have the fix.

So ultimately it seems that the best way to handle this is:

    1. Only deal with this on reschedules.
    2. Let the cell conductor orchestrate it since it's already dealing
       with the reschedule. Then the compute logic doesn't need to change.

[1] https://review.openstack.org/#/c/587071/3/nova/compute/manager.py@1631
[2] https://review.openstack.org/#/c/587071/4/nova/compute/manager.py@1667

Conflicts:

nova/tests/unit/conductor/test_conductor.py

NOTE(mriedem): There was a minor conflict due to not having change
I56fb1fd984f06a58c3a7e8c2596471991950433a in Queens.

    Change-Id: I739c06bd02336bf720cddacb21f48e7857378487
    Closes-bug: #1784353
    (cherry picked from commit 41452a5c6adb8cae54eef24803f4adc468131b34)
    (cherry picked from commit d3397788fe2d9267c34698d9459b0abe3f215046)

Reviewed:  https://review.openstack.org/612496
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3a0f26c822912dbaad7c8a1d6a3c599025afab37
Submitter: Zuul
Branch:    stable/queens

commit 3a0f26c822912dbaad7c8a1d6a3c599025afab37
Author: Lee Yarwood <lyarwood@redhat.com>
Date:   Mon Jul 30 13:41:35 2018 +0100

conductor: Recreate volume attachments during a reschedule
    
    When an instance with attached volumes fails to spawn, cleanup code
    within the compute manager (_shutdown_instance called from
    _build_resources) will delete the volume attachments referenced by
    the bdms in Cinder. As a result we should check and if necessary
    recreate these volume attachments when rescheduling an instance.
    
    Note that there are a few different ways to fix this bug by
    making changes to the compute manager code, either by not deleting
    the volume attachment on failure before rescheduling [1] or by
    performing the get/create check during each build after the
    reschedule [2].
    
    The problem with *not* cleaning up the attachments is if we don't
    reschedule, then we've left orphaned "reserved" volumes in Cinder
    (or we have to add special logic to tell compute when to cleanup
    attachments).
    
    The problem with checking the existence of the attachment on every
    new host we build on is that we'd be needlessly checking that for
    initial creates even if we don't ever need to reschedule, unless
    again we have special logic against that (like checking to see if
    we've rescheduled at all).
    
    Also, in either case that involves changes to the compute means that
    older computes might not have the fix.
    
    So ultimately it seems that the best way to handle this is:
    
    1. Only deal with this on reschedules.
    2. Let the cell conductor orchestrate it since it's already dealing
       with the reschedule. Then the compute logic doesn't need to change.
    
    [1] https://review.openstack.org/#/c/587071/3/nova/compute/manager.py@1631
    [2] https://review.openstack.org/#/c/587071/4/nova/compute/manager.py@1667
    
    Conflicts:
    
      nova/tests/unit/conductor/test_conductor.py
    
    NOTE(mriedem): There was a minor conflict due to not having change
    I56fb1fd984f06a58c3a7e8c2596471991950433a in Queens.
    
    Change-Id: I739c06bd02336bf720cddacb21f48e7857378487
    Closes-bug: #1784353
    (cherry picked from commit 41452a5c6adb8cae54eef24803f4adc468131b34)
    (cherry picked from commit d3397788fe2d9267c34698d9459b0abe3f215046)