For ml2/OVN live-migration doesn't work. After spending some time debugging this issue I found that its potentially more complicated and not related to OVN intself.
Here is the full story behind not working live-migration while using OVN in latest u/s master.
To speedup live-migration double-binding was introduced in neutron [1] and nova [2]. It implements this blueprint [3]. In short words it creates double binding (ACTIVE and INACTIVE) to verify if network bind is possible to be done on destination host and then starts live-migration (to not waste time in case of rollback).
This mechanism started to be default in Stein [4]. So before actual qemu live-migration neutron should send 'network-vif-plugged' to nova and then migration is being run.
While using OVN this mechanism doesn't work. Notification 'network-vif-plugged' is not being send so live-migration is stuck at the beginning.
Lets check how those notifications are send. On every change of 'status' field (sqlalchemy event) in neutron.ports row [5] function [6] is executed and it is responsible for sending 'network-vif-unplugged' and 'network-vif-plugged' notifications.
During pre_live_migration tasks two bindings and bindings levels are created. At the end of this process I found that commit_port_binding() is executed [7]. At this time neutron port status in the db is DOWN.
I found that at the end of commit_port_binding() [8] after neutron_lib.callbacks.registry notification is send the port status moves to UP. For ml2/OVN it stays DOWN. This is the first difference that I found between ml2/ovs and ml2/ovn.
After a bit digging I figured out how 'network-vif-plugged' is triggered in ml2/ovs.
Lets see how this is done.
1. On list of registered callbacks in ml2/ovs [8] we have configured callback from class ovo_rpc._ObjectChangeHandler [9] and at the end of commit_port_binding() this callback is used.
-------------------------------------------------------------
neutron.plugins.ml2.ovo_rpc._ObjectChangeHandler.handle_event
-------------------------------------------------------------
2. It is responsible for pushing new port object revisions to agents, like:
----------------------------------------------------------------------------
Jun 24 10:01:01 test-migrate-1 neutron-server[3685]: DEBUG neutron.api.rpc.handlers.resources_rpc [None req-1430f349-d644-4d33-8833-90fad0124dcd service neutron] Pushing event updated for resources: {'Port': ['ID=3704a567-ef4c-4f6d-9557-a1191de07c4a,revision_number=10']} {{(pid=3697) push /opt/stack/neutron/neutron/api/rpc/handlers/resources_rpc.py:243}}
----------------------------------------------------------------------------
3. OVS agent consumes it and sends back RPC to the neutron server that port is actually UP (on source node!):
------------------------------------------------------------------------------------------------------------
Jun 24 10:01:01 test-migrate-1 neutron-openvswitch-agent[18660]: DEBUG neutron.agent.resource_cache [None req-1430f349-d644-4d33-8833-90fad0124dcd service neutron] Resource Port 3704a567-ef4c-4f6d-9557-a1191de07c4a updated (revision_number 8->10). Old fields: {'status': u'ACTIVE', 'bindings': [PortBinding(host='test-migrate-1',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={},status='INACTIVE',vif_details={"port_filter": true, "bridge_name": "br-int", "datapath_type": "system", "ovs_hybrid_plug": false},vif_type='ovs',vnic_type='normal'), PortBinding(host='test-migrate-2',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={"migrating_to": "test-migrate-1"},status='ACTIVE',vif_details={"port_filter": true, "bridge_name": "br-int", "datapath_type": "system", "ovs_hybrid_plug": false},vif_type='ovs',vnic_type='normal')], 'binding_levels': [PortBindingLevel(driver='openvswitch',host='test-migrate-1',level=0,port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,segment=NetworkSegment(c6866834-4577-497f-a6c8-ff9724a82e59),segment_id=c6866834-4577-497f-a6c8-ff9724a82e59), PortBindingLevel(driver='openvswitch',host='test-migrate-2',level=0,port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,segment=NetworkSegment(c6866834-4577-497f-a6c8-ff9724a82e59),segment_id=c6866834-4577-497f-a6c8-ff9724a82e59)]} New fields: {'status': u'DOWN', 'bindings': [PortBinding(host='test-migrate-1',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={},status='ACTIVE',vif_details={"port_filter": true, "bridge_name": "br-int", "datapath_type": "system", "ovs_hybrid_plug": false},vif_type='ovs',vnic_type='normal'), PortBinding(host='test-migrate-2',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={"migrating_to": "test-migrate-1"},status='INACTIVE',vif_details=None,vif_type='unbound',vnic_type='normal')], 'binding_levels': [PortBindingLevel(driver='openvswitch',host='test-migrate-1',level=0,port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,segment=NetworkSegment(c6866834-4577-497f-a6c8-ff9724a82e59),segment_id=c6866834-4577-497f-a6c8-ff9724a82e59)]} {{(pi
Jun 24 10:01:01 test-migrate-1 neutron-openvswitch-agent[18660]: d=18660) record_resource_update /opt/stack/neutron/neutron/agent/resource_cache.py:186}}
...
Jun 24 10:01:02 test-migrate-1 neutron-openvswitch-agent[18660]: DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-9daaf112-57f4-49bb-8390-4b65a5c5e674 None None] Setting status for 3704a567-ef4c-4f6d-9557-a1191de07c4a to UP {{(pid=18660) _bind_devices /opt/stack/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1088}}
------------------------------------------------------------------------------------------------------------
4. Neutron server consumes it:
------------------------------------------------------------------------------------------------------------
Jun 24 10:01:02 test-migrate-1 neutron-server[3685]: DEBUG neutron.plugins.ml2.rpc [None req-62e69669-fa7e-4f70-9e38-38cb3e2c30a7 None None] Device 3704a567-ef4c-4f6d-9557-a1191de07c4a up at agent ovs-agent-test-migrate-1 {{(pid=3698) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:269}}
...
Jun 24 10:01:02 test-migrate-1 neutron-server[3685]: DEBUG neutron.db.provisioning_blocks [None req-62e69669-fa7e-4f70-9e38-38cb3e2c30a7 None None] Provisioning for port 3704a567-ef4c-4f6d-9557-a1191de07c4a completed by entity L2. {{(pid=3698) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:133}}
...
Jun 24 10:01:02 test-migrate-1 neutron-server[3685]: DEBUG neutron.db.provisioning_blocks [None req-62e69669-fa7e-4f70-9e38-38cb3e2c30a7 None None] Provisioning complete for port 3704a567-ef4c-4f6d-9557-a1191de07c4a triggered by entity L2. {{(pid=3698) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:140}}
------------------------------------------------------------------------------------------------------------
and then generates internal event "PROVISIONING_COMPLETE" [10]. This event is consumed by [11] and port_provisioned() updates port status in the DB to UP [12]. At the end it emits notification 'network-vif-plugged' and nova continues migration.
In ml2/ovn we don't have agents, so we don't use ovo_rpc. That's why migration for ml2/ovn doesn't work.
It looks like general bug somewhere between nova and neutron. Neutron shouldn't send notification 'network-vif-plug' during configuration of double binding from source host like it is now (paragraph 3.)
Maybe we could consider using some more sophisticated names, like 'neutron-vif-inactive-binding-set'?
Maybe nova could watch for inactive binding being created [13] and then start live-migration
instead waiting for neutron notification?
Thanks,
Maciej
[1] https://review.opendev.org/#/q/topic:bp/live-migration-portbinding+(status:open+OR+status:merged)
[2] https://review.opendev.org/#/c/558001/
[3] https://blueprints.launchpad.net/nova/+spec/neutron-new-port-binding-api
[4] https://review.opendev.org/#/c/635360/
[5] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/db/db_base_plugin_v2.py#L173
[6] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/notifiers/nova.py#L182
[7] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L505
[8] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L713
[9] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/ovo_rpc.py#L51
[10] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/db/provisioning_blocks.py#L140
[11] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L285
[12] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L316
[13] https://specs.openstack.org/openstack/neutron-specs/specs/backlog/pike/portbinding_information_for_nova.html#list-bindings
1) I am very glad multiple port binding is being considered for use by networking-ovn.
2) It is true that during implementation the existence of agents was assumed. That wasn't an oversight. A careful read of the spec shows that the functionality was specified for an agents based implementation. To confirm this, just look at the state diagram here: https:/ /specs. openstack. org/openstack/ neutron- specs/specs/ backlog/ pike/portbindin g_information_ for_nova. html#activate- rpc-port- update- delete. In that sense, instead of a bug, this should be seen as "Multiple Port Bindings Phase II"
3) From the Neutron implementation perspective, it was never assumed that Nova was going to use the 'network- vif-plugged' event to move to the actual migration stage. That is a decision that was made on the Nova side. In Neutron the only assumption that was made (according to the spec) was that Nova would request the creation of an inactive binding and that upon the completion of it, Nova would proceed with the migration. I agree that the chosen way seems odd.
4) It maybe the case that Neutron agent in the source host is sending a port UP message. That is more an oversight than anything else. During implementation of multiple port binding, the strategy was to be as least intrusive as possible with what was already in place. This strategy was adopted given the fact that port binding is such a fundamental Neutron functionality. Having said that, I don't think the agent sending port UP is the major issue in this bug report. While we may optimize it, it is irrelevant from the point of view of OVN, since OVN doesn't have agents. The core of the issue is how to communicate with Nova properly once the inactive binding has been created, so the migration can continue