Bug #1901707 “race condition on port binding vs instance being r...” : Series stein : Bugs : OpenStack Compute (nova)

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2020-11-03:

#1

I started digesting the linked bug and stuff referred from there and I find it surprisingly complex. Could we re-state the piece of the problem you want to separate here, to help somebody take this bug? I don't want to be dense, but I usually find that re-stating a problem in shorter, simpler way helps solving it.

Is this problem present on master?

Is this dependent on nova using the multiple bindings feature? (I guess yes, because the nova side of that was merged in rocky.)

Is this specific to who plugs the port on the destination host: libvirt and/or os-vif? If yes, which one?

Could we have steps to reproduce this? I get this a race, so the reproduction probably won't be 100%. I also get firewall_driver=iptables_hybrid and live_migration_wait_for_vif_plug=true (default value) is needed. Is there anything else needed to reproduce this bug?

For what it's worth these are the current triggers for neutron to send os-server-external-events to nova:
https://opendev.org/openstack/neutron/src/commit/cbaa328f2ba80ba0af33f43887a040cdd08e508b/neutron/notifiers/nova.py#L102-L103

I believe the first (and currently only) notification neutron sends is needed and used, so we should not change whether or when that is sent. Is this understanding correct?

Do you believe there should be a 2nd notification sent from neutron to nova? If yes, at what time (triggered by what) should it be sent?

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Tobias Urdin (tobias-urdin) wrote on 2020-11-03:

#2

I don't have a way to test this with any other version than Train right now, this was not an issue on CentOS 7 with Train but when we moved to CentOS 8 with Train this started happening.

What I understand from Sean's input is that the behavior has changed in Neutron, before Neutron would allow two ports to be active so the new port on the compute node would already be ready but now with multiple bindings feature that is not the case anymore.

It's the plugging in openvswitch that is the issue, the port managed by neutron's openvswitch-agent.

IMO there should be an event sent to Nova when the port is fully ready so that Nova could do the live migration after that, but given that the behavior has changed in Neutron maybe it's no longer possible or
allowed to have two ports configured and active.

I can reproduce this 100% of the time with the versions mentioned, the other bug is primarily about another bug which is when the openvswitch firewall driver is used, this is when iptables_hybrid is used but that doesn't seem to be the cause of the issue either way.

I don't have a good way to go about it, since if Sean's comment about it being a behavior change in Neutron that might not be able to workaround there isn't much Nova can do. This pretty much breaks the whole purpose of live-migration since we need to carry a custom patch in Nova that makes the VM send out new RARP frames AFTER the live migration (data plane is therefore dependent on the timings of the control plane running the post_live_migration action in Nova) so we are taking a hit with some second(s) of downtime extra.

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2020-12-11

Changed in neutron:
assignee:	nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2020-12-11:

#3

Download full text (3.7 KiB)

Hi:

I detected this problem too. The main problem we have in Neutron is that the "neutron-vif-plugged" event is sent in many situations: when a port is provisioned by the DHCP agent, when the port is bound by the L2 agent or when the port passes from status DOWN to ACTIVE.

For example, when a port is detected by a OVS agent, it binds it to this host and the sends to the server (via RPC) a "update_device_list". The Neutron server receives this list and updates the port status, calling "update_device_up". That calls "update_port_status_to_active" [1] that triggers the port provisioning. This is catched by [2] that updates the port status to ACTIVE. That triggers the Nova notification.

When the port is live migrated, since [3] (live migration with multiple portbinding), the port can has two port bonding definitions: the source host (SOURCE) and the destination host (DEST).

The SOURCE is, until the migration finishes, active. In the profile (a dictionary field), a new key is added: "migration_to", with the name of DEST host.

The DEST is disabled. Is activated when the SOURCE binding is deleted from the port.

A) EXPLANATION OF THE CONNECTIVITY BREAKDOWN
Now, the DEST port is bound to the host when the DEST binding is enabled (as defined in [3]). The problem is that this moment is too late. Nova has already set the ofport of the port (in case of hybrid_plugin=False) because has unpaused the MV in DEST. That means during the time the VM is unpaused and the OVS agent binds the port to the host (sets the OpenFlow rules in OVS), there is

B) EXPLANATION OF THE EVENTS RACE CONDITION
As commented, we are sending the "neutron-vif-plugged" event in many occasions. But this Nova event, at least during the live-migration, is meant to be sent only when the DEST port is bound to the host. That means when the OVS agent in DEST creates the OpenFlow rules and leaves the port ready to be used. **This happens now by pure chance**: when the port is migrated, the port bindings are first deleted and then updated [4]. That means the port is set to DOWN and then activated again (--> that triggers the first "network-vif-plugged" event). Nova reads this event and unpauses the VM in DEST. So just the opposite as it should be.

There are also other triggers that can send the "network-vif-plugged" event, in any other.
1) When the port binding is updated (with the two hosts, SOURCE and DEST), the port is provisioned again by the DHCP agent. This can send this event.
2) When the port binding is updated (first clear and then set again), the SOURCE OVS agent can read both changes in different polling cycles. That will unbind first the port, seinding an update to the server, that will send a "network-vif-unplugged" event. Then, the port is bound again, that will trigger a "network-vif-plugged" event.

During the live-migration:
1) We need to catch those events not generated by the OVS SOURCE agent and dismiss them.
2) We need to bind the port to SOURCE **before** the port activation (please read B). Nova is activating the port because other processes are sending the plugged event, but should be the SOURCE binding process the only one sending it.

I'm pushing https://review...