mnaser reported a weird case where an instance was found
in both cell0 (deleted there) and in cell1 (not deleted
there but in error state from a failed build). It's unclear
how this could happen besides some weird clustered rabbitmq
issue where maybe the schedule and build request to conductor
happens twice for the same instance and one picks a host and
tries to build and the other fails during scheduling and is
buried in cell0.
To avoid a split brain situation like this, we add a sanity
check in _bury_in_cell0 to make sure the instance mapping is
not pointing at a cell when we go to update it to cell0.
Similarly a check is added in the schedule_and_build_instances
flow (the code is moved to a private method to make it easier
to test).
Worst case is this is unnecessary but doesn't hurt anything,
best case is this helps avoid split brain clustered rabbit
issues.
Closes-Bug: #1775934
Change-Id: I335113f0ec59516cb337d34b6fc9078ea202130f
(cherry picked from commit 5b552518e1abdc63fb33c633661e30e4b2fe775e)
(cherry picked from commit efc35b1c5293c7c6c85f8cf9fd9d8cd8de71d1d5)
Reviewed: https:/ /review. opendev. org/756404 /git.openstack. org/cgit/ openstack/ nova/commit/ ?id=c895d3e6bca 562225d70e8f812 55f38970f7fcda
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit c895d3e6bca5622 25d70e8f81255f3 8970f7fcda
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 20 17:07:35 2019 -0400
Sanity check instance mapping during scheduling
mnaser reported a weird case where an instance was found
in both cell0 (deleted there) and in cell1 (not deleted
there but in error state from a failed build). It's unclear
how this could happen besides some weird clustered rabbitmq
issue where maybe the schedule and build request to conductor
happens twice for the same instance and one picks a host and
tries to build and the other fails during scheduling and is
buried in cell0.
To avoid a split brain situation like this, we add a sanity and_build_ instances
check in _bury_in_cell0 to make sure the instance mapping is
not pointing at a cell when we go to update it to cell0.
Similarly a check is added in the schedule_
flow (the code is moved to a private method to make it easier
to test).
Worst case is this is unnecessary but doesn't hurt anything,
best case is this helps avoid split brain clustered rabbit
issues.
Closes-Bug: #1775934
Change-Id: I335113f0ec5951 6cb337d34b6fc90 78ea202130f 3fb33c633661e30 e4b2fe775e) 6c85f8cf9fd9d8c d8de71d1d5)
(cherry picked from commit 5b552518e1abdc6
(cherry picked from commit efc35b1c5293c7c