Note that https://review.openstack.org/#/c/587498/ sort of incorrectly says it fixes this bug, which it doesn't really, it fixes a symptom of this bug which is that after the failed port binding during live migration, restarting the source compute fails (it's more aligned bug 1738373).
The port binding failure could have been due to the neutron agent being down on the destination host during live migration, or maybe that host was out of fixed IPs, something like that.
The root issue is nova saving off the binding_failed vif_type in the instance info_cache which led to the failure to restart nova-compute later.
I suspect the binding_failed data is getting put into the instance when the source compute gets a network-changed event from neutron after the port binding failure which changed the vif_type on the port and then nova saves that change into the info_cache.
There are a couple of related fixes for that binding_failed info_cache value:
1. https://review.openstack.org/#/c/603844/ - that would be a manual recovery action to try and reboot/rebuild the instance to force a re-binding of the port on the original host and fix the binding failure.
2. https://review.openstack.org/#/c/591607/ - that would force the info_cache to be refreshed periodically from the actual current state of the port in neutron, rather than what is in the info_cache and could be wrong/out of date if the port binding was later fixed once the neutron agent was brought back online?
--
Alternatively, nova could ignore binding_failed vif_type changes during network-changed events, but that might lead to weird side effects if nova's version of the port state (in the info_cache) is different from the actual state in neutron.
Note that https:/ /review. openstack. org/#/c/ 587498/ sort of incorrectly says it fixes this bug, which it doesn't really, it fixes a symptom of this bug which is that after the failed port binding during live migration, restarting the source compute fails (it's more aligned bug 1738373).
The port binding failure could have been due to the neutron agent being down on the destination host during live migration, or maybe that host was out of fixed IPs, something like that.
The root issue is nova saving off the binding_failed vif_type in the instance info_cache which led to the failure to restart nova-compute later.
I suspect the binding_failed data is getting put into the instance when the source compute gets a network-changed event from neutron after the port binding failure which changed the vif_type on the port and then nova saves that change into the info_cache.
There are a couple of related fixes for that binding_failed info_cache value:
1. https:/ /review. openstack. org/#/c/ 603844/ - that would be a manual recovery action to try and reboot/rebuild the instance to force a re-binding of the port on the original host and fix the binding failure.
2. https:/ /review. openstack. org/#/c/ 591607/ - that would force the info_cache to be refreshed periodically from the actual current state of the port in neutron, rather than what is in the info_cache and could be wrong/out of date if the port binding was later fixed once the neutron agent was brought back online?
--
Alternatively, nova could ignore binding_failed vif_type changes during network-changed events, but that might lead to weird side effects if nova's version of the port state (in the info_cache) is different from the actual state in neutron.