Evacuation still doesn't respect anti-affinity rules after adding late group check

Bug #1823825 reported by Boxiang Zhu
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Boxiang Zhu

Bug Description

Description
===========
When we evacuate multi servers simultaneously, it will break the anti-affinity rules of group.

Environment
===========
Prepare three nodes and deploy the multi nodes cluster with kolla-ansible.

Steps to reproduce
==================
1. we have 3 compute nodes
2. we created a server group with anti-affinity policy
=> nova server-group-create anti anti-affinity
3. we create 2 servers with this server group
=> nova boot --image cirros --flavor m1.tiny --nic net-id=ae01e08e-98af-402f-b057-0481fd80a874 --hint group=16b0d654-4cdb-45dc-b2cf-cac68e30c79e vm01
=> nova boot --image cirros --flavor m1.tiny --nic net-id=ae01e08e-98af-402f-b057-0481fd80a874 --hint group=16b0d654-4cdb-45dc-b2cf-cac68e30c79e vm02
4. we stop nova-compute service on the nodes where 2 VMs are running
=> nova show vm01 | grep "hypervisor"
=> nova show vm02 | grep "hypervisor"
=> docker stop nova_compute[on two compute nodes]
5. we evacuate 2 VMs at once
=> nova evacuate vm01
=> nova evacuate vm02
6. we check where 2 VMs are running now
=> nova show vm01 | grep "hypervisor"
=> nova show vm02 | grep "hypervisor"

Expected result
===============
1. both of them failed to evacuate
2. one of them succeeded to evacuate to the last node, and another failed to evacuate

Actual result
=============
1. both of them succeeded to evacuate to the last node

Tags: evacuate
Revision history for this message
Boxiang Zhu (bxzhu-5355) wrote :

A same bug is found here https://bugs.launchpad.net/mos/+bug/1735407.
It has been set 'Fix Released'. And you can find the patch here https://review.openstack.org/#/c/525242/

But in fact, after I test the evacuation now, but in fact it does not meet our requirement which still breaks the anti-affinity policy of group after it does late group check now.

Some codes as followed:
instances_uuids = objects.InstanceList.get_uuids_by_host(
    context, self.host)
ins_on_host = set(instances_uuids)
members = set(group.members)
# Determine the set of instance group members on this host
# which are not the instance in question. This is used to
# determine how many other members from the same anti-affinity
# group can be on this host.
members_on_host = ins_on_host & members - set([instance.uuid])

We only check the instances which are on host now. But if we simultaneously evacuate 2 instances, they will both get '[]' of 'members_on_host'.

So that we must check the migrations when we do late group check if the action is move operation.

Changed in nova:
assignee: nobody → Boxiang Zhu (bxzhu-5355)
status: New → In Progress
Revision history for this message
Boxiang Zhu (bxzhu-5355) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Boxiang Zhu (zhu.boxiang@99cloud.net) on branch: master
Review: https://review.openstack.org/649953

Revision history for this message
Boxiang Zhu (bxzhu-5355) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Boxiang Zhu <zhu.boxiang@99cloud.net>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/649963

Revision history for this message
Walid Moghrabi (walid-fdj) wrote :

Same issue here ...

We have a bunch of instances grouped in a server group with soft-anti-affinity policy.
Evacuating a host with "nova host-evacuate-live <host>" makes all the instances to be live migrated on the same node and not distributed with respect to the anti-affinity policy (we do have available hosts to receive them, juste to be precise).

Moreover, the whole "nova evacuate" is borked because all my instances are moved to only 1 host instead of being equaly distributed over the other available hosts in the AZ whatever there is a server group or not.
This is bad, this leads to over commited receiving hosts + colocated instances that should not be.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.