race condition on port binding vs instance being resumed for live-migrations
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
sean mooney | ||
Stein |
New
|
Undecided
|
Unassigned | ||
Train |
Fix Released
|
Undecided
|
Unassigned | ||
Ussuri |
Fix Released
|
Undecided
|
Unassigned | ||
Victoria |
Fix Released
|
Undecided
|
Unassigned | ||
neutron |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This is a separation from the discussion in this bug https:/
There comment https:/
detail the flow on a Train deployment using neutron 15.1.0 (controller) and 15.3.0 (compute) and nova 20.4.0
There is a race condition where nova live-migration will wait for neutron to send the network-vif-plugged event but when nova receives that event the live migration is faster than the OVS l2 agent can bind the port on the destination compute node.
This causes the RARP frames sent out to update the switches ARP tables to fail causing the instance to be completely unaccessible after a live migration unless these RARP frames are sent again or traffic is initiated egress from the instance.
See Sean's comments after for the view from the Nova side. The correct behavior should be that the port is ready for use when nova get's the external event, but maybe that is not possible from the neutron side, again see comments in the other bug.
Changed in neutron: | |
assignee: | nobody → Rodolfo Alonso (rodolfo-alonso-hernandez) |
Changed in neutron: | |
status: | Incomplete → In Progress |
tags: | added: neutron-proactive-backport-potential |
Changed in neutron: | |
status: | In Progress → Fix Released |
tags: | removed: neutron-proactive-backport-potential |
Changed in nova: | |
status: | In Progress → Fix Committed |
status: | Fix Committed → Fix Released |
I started digesting the linked bug and stuff referred from there and I find it surprisingly complex. Could we re-state the piece of the problem you want to separate here, to help somebody take this bug? I don't want to be dense, but I usually find that re-stating a problem in shorter, simpler way helps solving it.
Is this problem present on master?
Is this dependent on nova using the multiple bindings feature? (I guess yes, because the nova side of that was merged in rocky.)
Is this specific to who plugs the port on the destination host: libvirt and/or os-vif? If yes, which one?
Could we have steps to reproduce this? I get this a race, so the reproduction probably won't be 100%. I also get firewall_ driver= iptables_ hybrid and live_migration_ wait_for_ vif_plug= true (default value) is needed. Is there anything else needed to reproduce this bug?
For what it's worth these are the current triggers for neutron to send os-server- external- events to nova: /opendev. org/openstack/ neutron/ src/commit/ cbaa328f2ba80ba 0af33f43887a040 cdd08e508b/ neutron/ notifiers/ nova.py# L102-L103
https:/
I believe the first (and currently only) notification neutron sends is needed and used, so we should not change whether or when that is sent. Is this understanding correct?
Do you believe there should be a 2nd notification sent from neutron to nova? If yes, at what time (triggered by what) should it be sent?