live migration fails due to port binding duplicate key entry in post_live_migrate
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
In Progress
|
Medium
|
sean mooney | ||
Rocky |
Triaged
|
Medium
|
Unassigned | ||
Stein |
Triaged
|
Medium
|
Unassigned |
Bug Description
We are converting a site from RDO to OSA; At this stage all control nodes and net nodes are running OSA (Rocky), some compute are running RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).
We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).
When Live Migrating between RDO nodes we see no issues similarly when migrating between OSA nodes, we see no issue, however Live Migrating RDO -> OSA fails with the below error on the target.
2019-01-24 13:33:11.701 85926 INFO nova.network.
2019-01-24 13:33:59.357 85926 ERROR nova.network.
Digging further into the logs, reveals an issue with duplicate keys:
2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-
pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=
pe': 'unbound', 'ml2_port_
this is confirmed when reviewing the ml2_port_bindings table:
MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-
+------
| port_id | host | vif_type | vnic_type | vif_details | profile | status |
+------
| 5bedceef-
| 5bedceef-
+------
The exception is not caught and handled, the VM is stuck in migrating. According to OpenStack, the VM is still on the source compute node, whilst libvirt/virsh believes it to be on the target. Forcing the VM state to active, keeps the VM available, however rebooting will result in an ERROR state (this is resolved by destroying the VM in virsh on the target, forcing back to active state, and rebooting) nor can it be attempted to be migrated due to the error state in the DB (this can be fixed by manually removing the inactive port, and clearing the profile from the active port).
In discussions with both mnasser and sean-k-mooney, it is understood that there are two distinct live migration flows with regards to port binding
1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated
We can see is monitoring the ml2_port_binding table during live migrations that
1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
2. OSA -> OSA uses the new flow (two entries which are cleaned up)
3. RDO -> OSA use the new flow, two entries, which are not cleaned up
This is unexpected, as even in the RDO to RDO case, both nodes are Rocky and so the new process should be in use.
Adding more debug statements to nova/network/
if port_migrating or teardown:
# Now get the port details to process the ports
# binding profile info.
Looking at the state here
RDO - OSA (2019-02-13 13:41:33.809)
-------
{
device_
tenant_
BINDING_
}
OSA - OSA (2019-02-13 14:33:36.333)
-------
{
device_
tenant_
BINDING_
}
At this point in the flow, BINDING_HOST_ID is the source hostname
continuing:
data = self.list_
ports = data['ports']
LOG.info(
and examining again:
RDO - OSA (2019-02-13 13:41:33.887)
-------
ports =
[
....
u'binding:host_id': u'cc-compute10-
....
}
]
OSA - OSA: (2019-02-13 14:33:36.422)
-------
ports =
[
....
u'binding:host_id': u'cc-compute29-
....
}
]
now we can see that in the RDO - OSA case, the binding:host_id as returned in the ports is the source hostname; whereas in the OSA - OSA case, the binding:host_id is the target hostname
The actual fault itself is in:
self.network_
migration = {'source_compute': instance.host,
if we trace those calls in the RDO - OSA case, then we will end up here:
# Avoid rolling back updates if we catch an error above.
# TODO(lbeliveau): Batch up the port updates in one neutron call.
for port_id, updates in port_updates:
if updates:
in the OSA - OSA case, these calls appear to have no effect, in the RDO - RDO case they cause the internal error (as they are updating the ports rather than activating/deleting as expected)
information type: | Public → Private |
Changed in nova: | |
importance: | Undecided → High |
importance: | High → Medium |
Changed in nova: | |
assignee: | sean mooney (sean-k-mooney) → Matt Riedemann (mriedem) |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → sean mooney (sean-k-mooney) |
Changed in nova: | |
assignee: | sean mooney (sean-k-mooney) → Matt Riedemann (mriedem) |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → sean mooney (sean-k-mooney) |
This scenario doesn't really make sense to me:
1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
Because if the RDO nodes are running Rocky code, they should hit this in the live migration task in the conductor service:
https:/ /github. com/openstack/ nova/blob/ stable/ rocky/nova/ conductor/ tasks/live_ migrate. py#L282
Which enables the new flow if:
1. The neutron "Port Bindings Extended" API extension is available
2. Both the source and target nova-compute service versions in the "services" table in the cell database are version >= 35, which if you're running Rocky everywhere should be the case, but you said you have some RDO computes that are still Queens.
So I'd double check that first.