The race exists for iptables_hybrid based deployments when live_migration_wait_for_vif_plug=true (default value) as well.
The source compute node does the pre live migration and waits for the network-vif-plugged which it receives, the live migration starts, the VM is resumed on destination but the destination compute node binds the port AFTER the instance has already been resumed.
To understand the race issue.
* The pre live migration is complete when source compute node gets network-vif-plugged event
2020-10-23 10:39:31.634 3460854 INFO nova.compute.manager [-] [instance: 9ef8fcee-c1cf-4d2e-8b14-2b43c31044f6] Took 2.83 seconds for pre_live_migration on destination host compute-02.
2020-10-23 10:39:32.200 3460854 DEBUG nova.compute.manager [req-7f2c3034-c0b4-4e6b-9209-280638dcd2e1 6283ca84a2ff4cc099fcfd8e50550910 3a28d0f6b65a44c2aa1bbffbfa8bb2ea - default default] [instance: 9ef8fcee-c1cf-4d2e-8b14-2b43c31044f6] Received event network-vif-plugged-f83f20ad-feff-4369-a752-a81964bcfd52 external_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:9273
* Then the instance is resumed on the destination compute node
2020-10-23 10:39:35.467 2082170 INFO nova.compute.manager [req-5c20ab33-21eb-48b8-950f-85807ebc1559 - - - - -] [instance: 9ef8fcee-c1cf-4d2e-8b14-2b43c31044f6] VM Resumed (Lifecycle Event)
* But the port is not really updated or fixed on the destination compute node until after that
2020-10-23 10:39:37.504 2096718 DEBUG neutron.agent.resource_cache [req-3b8c2e3f-4e62-446b-b9db-f6bf12012ab0 f1fc63f1306549a0b1aba80875aac683 3a28d0f6b65a44c2aa1bbffbfa8bb2ea - - -] Resource Port f83f20ad-feff-4369-a752-a81964bcfd52 updated <a lot of port binding data here>
It also passes another round of fixing the port at
2020-10-23 10:39:37.859 2096718 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-d0d6ab14-b56e-4d91-9030-7f422465f628 - - - - -] Port f83f20ad-feff-4369-a752-a81964bcfd52 updated
and the done line is not until
2020-10-23 10:39:38.572 2096718 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-d0d6ab14-b56e-4d91-9030-7f422465f628 - - - - -] Configuration for devices up ['f83f20ad-feff-4369-a752-a81964bcfd52'] and devices down [] completed.
This means that there is a race of about ~3 seconds there when the instance is resumed vs when the port is bound.
Now the question is, nova is properly waiting for the network-vif-plugged event but that is not really the time when the port is ready, so is there any other event that we should/could wait for or is this a neutron issue in the end?
The race exists for iptables_hybrid based deployments when live_migration_ wait_for_ vif_plug= true (default value) as well.
The source compute node does the pre live migration and waits for the network-vif-plugged which it receives, the live migration starts, the VM is resumed on destination but the destination compute node binds the port AFTER the instance has already been resumed.
To understand the race issue.
* The pre live migration is complete when source compute node gets network-vif-plugged event
2020-10-23 10:39:31.634 3460854 INFO nova.compute. manager [-] [instance: 9ef8fcee- c1cf-4d2e- 8b14-2b43c31044 f6] Took 2.83 seconds for pre_live_migration on destination host compute-02. manager [req-7f2c3034- c0b4-4e6b- 9209-280638dcd2 e1 6283ca84a2ff4cc 099fcfd8e505509 10 3a28d0f6b65a44c 2aa1bbffbfa8bb2 ea - default default] [instance: 9ef8fcee- c1cf-4d2e- 8b14-2b43c31044 f6] Received event network- vif-plugged- f83f20ad- feff-4369- a752-a81964bcfd 52 external_ instance_ event /usr/lib/ python3. 6/site- packages/ nova/compute/ manager. py:9273
2020-10-23 10:39:32.200 3460854 DEBUG nova.compute.
* Then the instance is resumed on the destination compute node
2020-10-23 10:39:35.467 2082170 INFO nova.compute. manager [req-5c20ab33- 21eb-48b8- 950f-85807ebc15 59 - - - - -] [instance: 9ef8fcee- c1cf-4d2e- 8b14-2b43c31044 f6] VM Resumed (Lifecycle Event)
* But the port is not really updated or fixed on the destination compute node until after that
2020-10-23 10:39:37.504 2096718 DEBUG neutron. agent.resource_ cache [req-3b8c2e3f- 4e62-446b- b9db-f6bf12012a b0 f1fc63f1306549a 0b1aba80875aac6 83 3a28d0f6b65a44c 2aa1bbffbfa8bb2 ea - - -] Resource Port f83f20ad- feff-4369- a752-a81964bcfd 52 updated <a lot of port binding data here>
It also passes another round of fixing the port at
2020-10-23 10:39:37.859 2096718 INFO neutron. plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-d0d6ab14- b56e-4d91- 9030-7f422465f6 28 - - - - -] Port f83f20ad- feff-4369- a752-a81964bcfd 52 updated
and the done line is not until
2020-10-23 10:39:38.572 2096718 INFO neutron. plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-d0d6ab14- b56e-4d91- 9030-7f422465f6 28 - - - - -] Configuration for devices up ['f83f20ad- feff-4369- a752-a81964bcfd 52'] and devices down [] completed.
This means that there is a race of about ~3 seconds there when the instance is resumed vs when the port is bound.
Now the question is, nova is properly waiting for the network-vif-plugged event but that is not really the time when the port is ready, so is there any other event that we should/could wait for or is this a neutron issue in the end?