Hi Matt
Here are some more detailed logs, which may assist - it seems easier to mail than try and copy paste into launchpad; plus I haven’t sanitised them
Compute10 == RDO (Rocky) -> Source
Compute29 == OSA (Rocky) -> Destination
Compute28 == OSA (Rocky) -> Source
i.e.: OSA - OSA is 28 -> 29
RDO - OSA is 10 - 29
Osa_osa_mysql is the port bindings throughout an OSA to osa migration
Mysql is the port bindings through an RDO - OSA migration
- Sorry, I haven’t got a capture like this for RDO - ROD immediately available; I can generate one though if it will help
- the lines [BRUCE] (XXX) are extra Logging statements I’ve added; if it is A-X it is in api.py; if it’s M-X its in migration.py (I can send these modified files as well)
- I have deconstructed the logs and added my own extra comments in some places
--
Robert Colvin
Senior System Engineer
Mobile: +46 700 194847
www.citynetwork.eu | www.citycloud.com
INNOVATION THROUGH OPEN IT INFRASTRUCTURE
ISO 9001, 14001, 27001, 27015 & 27018 CERTIFIED
> On 4 Apr 2019, at 04:33, Matt Riedemann wrote:
>
> https://review.openstack.org/#/c/649464/ didn't fail the nova-live-
> migration job, with or without post-copy enabled, meaning the
> destination port binding gets activated as part of
> post_live_migration_at_destination, which goes through the code that
> updates the port's binding:host_id value, which is what you reported
> caused the failure, so I'm not sure yet how to recreate this.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1822884
>
> Title:
> live migration fails due to port binding duplicate key entry in
> post_live_migrate
>
> Status in OpenStack Compute (nova):
> New
>
> Bug description:
> We are converting a site from RDO to OSA; At this stage all control
> nodes and net nodes are running OSA (Rocky), some compute are running
> RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).
>
> We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to
> OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).
>
> When Live Migrating between RDO nodes we see no issues similarly when
> migrating between OSA nodes, we see no issue, however Live Migrating
> RDO -> OSA fails with the below error on the target.
>
> 2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
> 2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.
>
> Digging further into the logs, reveals an issue with duplicate keys:
>
> 2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
> pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
> pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)
>
> this is confirmed when reviewing the ml2_port_bindings table:
>
> MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
> +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
> | port_id | host | vif_type | vnic_type | vif_details | profile | status |
> +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
> | 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE |
> | 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | | INACTIVE |
> +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
>
> The exception is not caught and handled, the VM is stuck in migrating.
> According to OpenStack, the VM is still on the source compute node,
> whilst libvirt/virsh believes it to be on the target. Forcing the VM
> state to active, keeps the VM available, however rebooting will result
> in an ERROR state (this is resolved by destroying the VM in virsh on
> the target, forcing back to active state, and rebooting) nor can it be
> attempted to be migrated due to the error state in the DB (this can be
> fixed by manually removing the inactive port, and clearing the profile
> from the active port).
>
> In discussions with both mnasser and sean-k-mooney, it is understood
> that there are two distinct live migration flows with regards to port
> binding
>
> 1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
> 2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated
>
> We can see is monitoring the ml2_port_binding table during live
> migrations that
>
> 1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
> 2. OSA -> OSA uses the new flow (two entries which are cleaned up)
> 3. RDO -> OSA use the new flow, two entries, which are not cleaned up
>
> This is unexpected, as even in the RDO to RDO case, both nodes are
> Rocky and so the new process should be in use.
>
> (https://git.openstack.org/cgit/openstack/nova-
> specs/tree/specs/queens/approved/neutron-new-port-binding-
> api.rst?h=refs/changes/80/375580/16)
>
> Adding more debug statements to nova/network/neutronv2/api.py I can
> isolate where the code paths diverge in the RDO - RDO and RDO - OSA
> cases.
>
> if port_migrating or teardown:
> LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
> search_opts = {'device_id': instance.uuid,
> 'tenant_id': instance.project_id,
> BINDING_HOST_ID: instance.host}
> # Now get the port details to process the ports
> # binding profile info.
> LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)
>
> Looking at the state here
>
> RDO - OSA (2019-02-13 13:41:33.809)
> ------------------------------------------------------------
> {
> device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID
> tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
> BINDING_HOST_ID(cc-compute10-kna1) <----- source
> }
>
> OSA - OSA (2019-02-13 14:33:36.333)
> -------------------------------------------------------------------
> {
> device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID
> tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
> BINDING_HOST_ID(cc-compute28-kna1)} <----- source
> }
>
> At this point in the flow, BINDING_HOST_ID is the source hostname
>
> continuing:
>
> data = self.list_ports(context, **search_opts)
> ports = data['ports']
> LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)
>
> and examining again:
>
> RDO - OSA (2019-02-13 13:41:33.887)
> ------------------------------------------------------------
> ports =
> [
> ....
> u'binding:host_id': u'cc-compute10-kna1',
> ....
> }
> ]
>
> OSA - OSA: (2019-02-13 14:33:36.422)
> -----------------------------------------------------------------------
> ports =
> [
> ....
> u'binding:host_id': u'cc-compute29-kna1',
> ....
> }
> ]
>
> now we can see that in the RDO - OSA case, the binding:host_id as
> returned in the ports is the source hostname; whereas in the OSA - OSA
> case, the binding:host_id is the target hostname
>
> The actual fault itself is in:
>
> self.network_api.setup_networks_on_host(context, instance,
> self.host)
> migration = {'source_compute': instance.host,
> 'dest_compute': self.host, }
> self.network_api.migrate_instance_finish(context,
> instance,
> migration)
>
> if we trace those calls in the RDO - OSA case, then we will end up
> here:
>
> # Avoid rolling back updates if we catch an error above.
> # TODO(lbeliveau): Batch up the port updates in one neutron call.
> for port_id, updates in port_updates:
> if updates:
> LOG.info("Updating port %(port)s with "
> "attributes %(attributes)s",
> {"port": port_id, "attributes": updates},
> instance=instance)
> try:
> neutron.update_port(port_id, {'port': updates})
> except Exception:
> with excutils.save_and_reraise_exception():
> LOG.exception("Unable to update binding details "
> "for port %s",
> port_id, instance=instance)
>
> in the OSA - OSA case, these calls appear to have no effect, in the
> RDO - RDO case they cause the internal error (as they are updating the
> ports rather than activating/deleting as expected)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/1822884/+subscriptions