After discussions with mriedem on IRC, it's worth noting that the above patch doesn't fix the underlying issue so much as a side-effect of that, namely, the inability of nova-compute to restart after the error has occurred. What is does fix is bug #1738373, which is solely focused on that side effect. That bug has now been marked as a duplicate of this one.
Cleaned up logs from the IRC discussion on #nova-compute below.
[19-12 15:12:52] <stephenfin> mriedem: Not to distract you now, but did you make a mistake on https://github.com/openstack/nova/commit/cdf8ba5acb ? You've said it fixes https://bugs.launchpad.net/nova/+bug/1784579 but that bug is for live migration, not compute service restart which is what your commit addresses
[19-12 15:13:18] <stephenfin> mriedem: I ask because I found a similar bug which does deal with the compute service restart https://bugs.launchpad.net/nova/+bug/1738373
[19-12 15:18:06] <mriedem> stephenfin: yes bug 1784579 is about os-vif port binding failed errors right?
[19-12 15:19:14] <stephenfin> mriedem: Yup, but it's to do with live migration and the fix is only for the service startup code path
[19-12 15:19:40] <stephenfin> At least, assuming I'm reading it right. I'll do some digging but just wanted to sanity check it before I dived down the rabbit hole :)
[19-12 15:19:56] <mriedem> stephenfin: the live migratoin fails because of the port binding failures
[19-12 15:21:15] <mriedem> stephenfin: comment
[19-12 15:21:16] <mriedem> 2
[19-12 15:21:17] <mriedem> "To summarize, it looks like the pre_live_migration method on the destination host fails to plug vifs and you end up with the "binding_failed" error, which is raised and makes the source live_migration method fail as expected. The failure is on the dest host. As a result, the info cache is updated with "binding_failed" which causes the source compute restart to fail here:"
[19-12 15:22:19] <mriedem> stephenfin: so no i didn't fix the original reason for the port binding failure in pre_live_migration, because that could have been for any number of reasons (neutron agent was down on the dest host?)
[19-12 15:22:38] <mriedem> i fixed a symptom of that failure, which was nova-compute failed to restart after that failure
[19-12 15:22:53] <mriedem> as the commit message says, "Admittedly this isn't the smartest thing and doesn't attempt
[19-12 15:22:54] <mriedem> to recover / fix the instance networking info"
[19-12 15:22:59] <stephenfin> mriedem: I'm missing something. Why make changes to 'ComputeManager.init_host' (via '_init_instance') in that commit? The exception was being seen in the live migration flow
[19-12 15:23:01] <stephenfin> ahhhhh
[19-12 15:23:21] <mriedem> 1. live migratoin fails, port binding failed - that gets saved in the info cache
[19-12 15:23:31] <mriedem> 2. restart source compute - that blows up because it wasn't handling binding_failed vif types in the os-vif conversion code
[19-12 15:23:38] <mriedem> i handle #2
[19-12 15:23:46] <mriedem> #1 is sort of out of my control
[19-12 15:23:50] <stephenfin> Your fix would inadvertently resolve https://bugs.launchpad.net/nova/+bug/1738373 so
[19-12 15:24:12] <mriedem> i mean, we probably shouldn't be saving off busted port binding information when pre_live_migration fails,
[19-12 15:24:27] <mriedem> since that overwrites the previously good port binding information from the source host
[19-12 15:25:03] <mriedem> i would have to dig into where we save off the bad port binding information
[19-12 15:25:07] <stephenfin> Yup, there's a related fix (also for live migration) that you worked on which looks more involved https://bugs.launchpad.net/nova/+bug/1783917
[19-12 15:26:02] <mriedem> ^ was a regression in rocky
[19-12 15:26:45] <mriedem> so i suppose my fix should have been related to bug 1784579
[19-12 15:26:49] <mriedem> not closes it
After discussions with mriedem on IRC, it's worth noting that the above patch doesn't fix the underlying issue so much as a side-effect of that, namely, the inability of nova-compute to restart after the error has occurred. What is does fix is bug #1738373, which is solely focused on that side effect. That bug has now been marked as a duplicate of this one.
Cleaned up logs from the IRC discussion on #nova-compute below.
[19-12 15:12:52] <stephenfin> mriedem: Not to distract you now, but did you make a mistake on https:/ /github. com/openstack/ nova/commit/ cdf8ba5acb ? You've said it fixes https:/ /bugs.launchpad .net/nova/ +bug/1784579 but that bug is for live migration, not compute service restart which is what your commit addresses /bugs.launchpad .net/nova/ +bug/1738373 .init_host' (via '_init_instance') in that commit? The exception was being seen in the live migration flow /bugs.launchpad .net/nova/ +bug/1738373 so /bugs.launchpad .net/nova/ +bug/1783917
[19-12 15:13:18] <stephenfin> mriedem: I ask because I found a similar bug which does deal with the compute service restart https:/
[19-12 15:18:06] <mriedem> stephenfin: yes bug 1784579 is about os-vif port binding failed errors right?
[19-12 15:19:14] <stephenfin> mriedem: Yup, but it's to do with live migration and the fix is only for the service startup code path
[19-12 15:19:40] <stephenfin> At least, assuming I'm reading it right. I'll do some digging but just wanted to sanity check it before I dived down the rabbit hole :)
[19-12 15:19:56] <mriedem> stephenfin: the live migratoin fails because of the port binding failures
[19-12 15:21:15] <mriedem> stephenfin: comment
[19-12 15:21:16] <mriedem> 2
[19-12 15:21:17] <mriedem> "To summarize, it looks like the pre_live_migration method on the destination host fails to plug vifs and you end up with the "binding_failed" error, which is raised and makes the source live_migration method fail as expected. The failure is on the dest host. As a result, the info cache is updated with "binding_failed" which causes the source compute restart to fail here:"
[19-12 15:22:19] <mriedem> stephenfin: so no i didn't fix the original reason for the port binding failure in pre_live_migration, because that could have been for any number of reasons (neutron agent was down on the dest host?)
[19-12 15:22:38] <mriedem> i fixed a symptom of that failure, which was nova-compute failed to restart after that failure
[19-12 15:22:53] <mriedem> as the commit message says, "Admittedly this isn't the smartest thing and doesn't attempt
[19-12 15:22:54] <mriedem> to recover / fix the instance networking info"
[19-12 15:22:59] <stephenfin> mriedem: I'm missing something. Why make changes to 'ComputeManager
[19-12 15:23:01] <stephenfin> ahhhhh
[19-12 15:23:21] <mriedem> 1. live migratoin fails, port binding failed - that gets saved in the info cache
[19-12 15:23:31] <mriedem> 2. restart source compute - that blows up because it wasn't handling binding_failed vif types in the os-vif conversion code
[19-12 15:23:38] <mriedem> i handle #2
[19-12 15:23:46] <mriedem> #1 is sort of out of my control
[19-12 15:23:50] <stephenfin> Your fix would inadvertently resolve https:/
[19-12 15:24:12] <mriedem> i mean, we probably shouldn't be saving off busted port binding information when pre_live_migration fails,
[19-12 15:24:27] <mriedem> since that overwrites the previously good port binding information from the source host
[19-12 15:25:03] <mriedem> i would have to dig into where we save off the bad port binding information
[19-12 15:25:07] <stephenfin> Yup, there's a related fix (also for live migration) that you worked on which looks more involved https:/
[19-12 15:26:02] <mriedem> ^ was a regression in rocky
[19-12 15:26:45] <mriedem> so i suppose my fix should have been related to bug 1784579
[19-12 15:26:49] <mriedem> not closes it