Thanks a lot for the thorough review and evaluation of the bug.
I appreciated it! It took me a while so I could get time to parse and
get you a proper response.
> 1 if the sriov nic agent is used for standard sriov vnic types (direct,
> direct-physical, macvtap) nics __must not__ be in __switchdev__ mode,
> __must__ be in __legacy__ mode
Not sure exactly what you mean here but, there is only one agent
(openvswitch-agent) on the computes and network nodes. That agent uses
the configuration as in [2] and is not configured as SRIOV, the
switchdev/hw offloading configuration is done in openvswitch.
> 2 vdpa support in nova current does not support any move operations,
> vdpa support in nova requires the nic to be in switchdev mode.
I don't believe we are using this.
> 3 hardware offloaded ovs uses the ml2/ovs or ml2/ovn mechansiume
> drivers and does not use the sriov nic agent.
Right, this is how we are doing.
> 4 we do not support using the sriov nic agent and ovs hardware offload
> on the same phsyical nic when using the sriov-nic-agent the nic must be
> in legacy mode and when using hardware offload it must be in swtichdev
> mode live migration form host using sriov nic agent to hardware
> offlowaded ovs was not in scope.
The migration is beteween 2 switchdev hosts with ml2/ovs
The bug above might be one of the possible problems related to this
message. If you follow the logs[3] you will see that here, this is
happening because:
1 - During pre_live_migration, the neutron port is attached on the
destination host[4]
2 - pre_live_migration fails on the destination host and triggers an
exception on the source host[5] [6]->[7]
3 - rollback is triggered and tries re-attach the port to source host,
but QEMU instance still holds the PCI address[8] and the PCI message
error is triggered
> what i suspect has happend here is the live migration fails in the
migration phase after pre_live_migrate
As I mentioned above, the failure was in the pre_live_migration funcion
(I caused in in my env, but it happeneded for some reason at the customer
site)
> unless you have correctly configured network manager in the guest to
> retrigger on the hotplug of the interface the guest wont have network
> connectivty restored until it reboots and the on boot network
> configurtion scripts run.
So, these are standard ubuntu images and are for sure configured to
hotplug given they don't loose connection when the migration works.
So, it seems that the bug here is that rollback_live_migration_at_source()
is called for both when the migration fails on pre_live_migration()
or live_migration. But, for the case when something fails on
pre_live_migration, this shouldn't be done.
Now I'm curious and will test if this attempt/error to re-attach the
device in the same address is the thing that is making the instance to
loose conectivity. I'll test that. Please let me know your thoguths on
my suspition above.
Hi Sean,
Thanks a lot for the thorough review and evaluation of the bug.
I appreciated it! It took me a while so I could get time to parse and
get you a proper response.
> 1 if the sriov nic agent is used for standard sriov vnic types (direct,
> direct-physical, macvtap) nics __must not__ be in __switchdev__ mode,
> __must__ be in __legacy__ mode
Not sure exactly what you mean here but, there is only one agent
(openvswitch-agent) on the computes and network nodes. That agent uses
the configuration as in [2] and is not configured as SRIOV, the
switchdev/hw offloading configuration is done in openvswitch.
> 2 vdpa support in nova current does not support any move operations,
> vdpa support in nova requires the nic to be in switchdev mode.
I don't believe we are using this.
> 3 hardware offloaded ovs uses the ml2/ovs or ml2/ovn mechansiume
> drivers and does not use the sriov nic agent.
Right, this is how we are doing.
> 4 we do not support using the sriov nic agent and ovs hardware offload
> on the same phsyical nic when using the sriov-nic-agent the nic must be
> in legacy mode and when using hardware offload it must be in swtichdev
> mode live migration form host using sriov nic agent to hardware
> offlowaded ovs was not in scope.
The migration is beteween 2 switchdev hosts with ml2/ovs
> this is cause by trying to do move other vms that have neutron sriov /bugs.launchpad .net/nova/ +bug/1851545
> port with shelve and unshleve
> https:/
The bug above might be one of the possible problems related to this
message. If you follow the logs[3] you will see that here, this is
happening because:
1 - During pre_live_migration, the neutron port is attached on the
destination host[4]
2 - pre_live_migration fails on the destination host and triggers an
exception on the source host[5] [6]->[7]
3 - rollback is triggered and tries re-attach the port to source host,
but QEMU instance still holds the PCI address[8] and the PCI message
error is triggered
> what i suspect has happend here is the live migration fails in the
migration phase after pre_live_migrate
As I mentioned above, the failure was in the pre_live_migration funcion
(I caused in in my env, but it happeneded for some reason at the customer
site)
> unless you have correctly configured network manager in the guest to
> retrigger on the hotplug of the interface the guest wont have network
> connectivty restored until it reboots and the on boot network
> configurtion scripts run.
So, these are standard ubuntu images and are for sure configured to
hotplug given they don't loose connection when the migration works.
So, it seems that the bug here is that rollback_ live_migration_ at_source( ) migration( )
is called for both when the migration fails on pre_live_
or live_migration. But, for the case when something fails on
pre_live_migration, this shouldn't be done.
Now I'm curious and will test if this attempt/error to re-attach the
device in the same address is the thing that is making the instance to
loose conectivity. I'll test that. Please let me know your thoguths on
my suspition above.
Erlon _______ _______ /gist.github. com/sombrafam/ ca6ba9224629a69 e48e571b5e45f20 40 agent.ini: https:/ /gist.github. com/sombrafam/ feab8c8f7a389d9 c92e89f35a629ab b0 /gist.githubuse rcontent. com/sombrafam/ 6edfc04fc456316 21c73054909df51 0d/raw/ 838f9d6f4139fc4 c52c8b22d5008a6 1d45dca0f6/ migration% 2520log /gist.github. com/sombrafam/ 6edfc04fc456316 21c73054909df51 0d#file- migration- log-L130 /gist.github. com/sombrafam/ 6edfc04fc456316 21c73054909df51 0d#file- migration- log-L179 /github. com/openstack/ nova/blob/ stable/ ussuri/ nova/virt/ libvirt/ driver. py#L9585 /github. com/openstack/ nova/blob/ stable/ ussuri/ nova/compute/ manager. py#L8157 /gist.github. com/sombrafam/ 6edfc04fc456316 21c73054909df51 0d#file- migration- log-L266
_______
[1] neutron.conf: https:/
[2] openvswitch_
[3] detailed migration error logs, of 29e7d319 from compute 0 -> compute 1: https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
[8] https:/