live migration fails due to port binding duplicate key entry in post_live_migrate

Bug #1822884 reported by Robert Colvin
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
sean mooney
Rocky
Triaged
Medium
Unassigned
Stein
Triaged
Medium
Unassigned

Bug Description

We are converting a site from RDO to OSA; At this stage all control nodes and net nodes are running OSA (Rocky), some compute are running RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).

We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).

When Live Migrating between RDO nodes we see no issues similarly when migrating between OSA nodes, we see no issue, however Live Migrating RDO -> OSA fails with the below error on the target.

2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.

Digging further into the logs, reveals an issue with duplicate keys:

2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)

this is confirmed when reviewing the ml2_port_bindings table:

MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| port_id | host | vif_type | vnic_type | vif_details | profile | status |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE |
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | | INACTIVE |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+

The exception is not caught and handled, the VM is stuck in migrating. According to OpenStack, the VM is still on the source compute node, whilst libvirt/virsh believes it to be on the target. Forcing the VM state to active, keeps the VM available, however rebooting will result in an ERROR state (this is resolved by destroying the VM in virsh on the target, forcing back to active state, and rebooting) nor can it be attempted to be migrated due to the error state in the DB (this can be fixed by manually removing the inactive port, and clearing the profile from the active port).

In discussions with both mnasser and sean-k-mooney, it is understood that there are two distinct live migration flows with regards to port binding

1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated

We can see is monitoring the ml2_port_binding table during live migrations that

1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
2. OSA -> OSA uses the new flow (two entries which are cleaned up)
3. RDO -> OSA use the new flow, two entries, which are not cleaned up

This is unexpected, as even in the RDO to RDO case, both nodes are Rocky and so the new process should be in use.

(https://git.openstack.org/cgit/openstack/nova-specs/tree/specs/queens/approved/neutron-new-port-binding-api.rst?h=refs/changes/80/375580/16)

Adding more debug statements to nova/network/neutronv2/api.py I can isolate where the code paths diverge in the RDO - RDO and RDO - OSA cases.

if port_migrating or teardown:
            LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
            search_opts = {'device_id': instance.uuid,
                           'tenant_id': instance.project_id,
                           BINDING_HOST_ID: instance.host}
            # Now get the port details to process the ports
            # binding profile info.
            LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)

Looking at the state here

RDO - OSA (2019-02-13 13:41:33.809)
------------------------------------------------------------
{
    device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID
    tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
    BINDING_HOST_ID(cc-compute10-kna1) <----- source
}

OSA - OSA (2019-02-13 14:33:36.333)
-------------------------------------------------------------------
{
    device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID
    tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
    BINDING_HOST_ID(cc-compute28-kna1)} <----- source
}

At this point in the flow, BINDING_HOST_ID is the source hostname

continuing:

data = self.list_ports(context, **search_opts)
ports = data['ports']
LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)

and examining again:

RDO - OSA (2019-02-13 13:41:33.887)
------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute10-kna1',
....
}
]

OSA - OSA: (2019-02-13 14:33:36.422)
-----------------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute29-kna1',
....
}
]

now we can see that in the RDO - OSA case, the binding:host_id as returned in the ports is the source hostname; whereas in the OSA - OSA case, the binding:host_id is the target hostname

The actual fault itself is in:

self.network_api.setup_networks_on_host(context, instance,
                                                         self.host)
        migration = {'source_compute': instance.host,
                     'dest_compute': self.host, }
        self.network_api.migrate_instance_finish(context,
                                                 instance,
                                                 migration)

if we trace those calls in the RDO - OSA case, then we will end up here:

# Avoid rolling back updates if we catch an error above.
        # TODO(lbeliveau): Batch up the port updates in one neutron call.
        for port_id, updates in port_updates:
            if updates:
                LOG.info("Updating port %(port)s with "
                         "attributes %(attributes)s",
                         {"port": port_id, "attributes": updates},
                         instance=instance)
                try:
                    neutron.update_port(port_id, {'port': updates})
                except Exception:
                    with excutils.save_and_reraise_exception():
                        LOG.exception("Unable to update binding details "
                                      "for port %s",
                                      port_id, instance=instance)

in the OSA - OSA case, these calls appear to have no effect, in the RDO - RDO case they cause the internal error (as they are updating the ports rather than activating/deleting as expected)

Revision history for this message
Matt Riedemann (mriedem) wrote :

This scenario doesn't really make sense to me:

1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)

Because if the RDO nodes are running Rocky code, they should hit this in the live migration task in the conductor service:

https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/tasks/live_migrate.py#L282

Which enables the new flow if:

1. The neutron "Port Bindings Extended" API extension is available
2. Both the source and target nova-compute service versions in the "services" table in the cell database are version >= 35, which if you're running Rocky everywhere should be the case, but you said you have some RDO computes that are still Queens.

So I'd double check that first.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Do you have [upgrade_levels]/compute pinned to queens or something in nova.conf?

Revision history for this message
Matt Riedemann (mriedem) wrote :

(4:29:22 PM) NewBruce: [upgrade_levels]
(4:29:22 PM) NewBruce: compute = auto
(4:29:44 PM) mriedem: so that will pin to the lowest nova-compute service version in the deployment while you're upgrading,
(4:29:57 PM) mriedem: so if you have queens computes, it will be using the queens rpc versions and backlevel the migrate_data object
(4:30:12 PM) mriedem: which likely is dropping the vifs information for the 2nd deactivated port binding
(4:30:28 PM) mriedem: that's my guess anyway
(4:30:28 PM) NewBruce: Aha….
(4:30:37 PM) NewBruce: ok, so can we override that?

Revision history for this message
Matt Riedemann (mriedem) wrote :

My suggestion would be to disable the neutron "Port Bindings Extended" API extension until you're fully upgraded to Rocky on all compute nodes because my guess is conductor is creating the deactivated target host port bindings because it's working with two computes that are both rocky, but because [upgrade_levels]/compute=auto and there are queens computes still, the versions of the MigrateData object is getting back-leveled which drops the 'vifs' field and that's what is used in the compute service code to determine if it's doing the new flow or the old flow.

tags: added: live-migration neutron rocky-backport-potential upgrade
Revision history for this message
Robert Colvin (rcolvin) wrote :
Download full text (18.8 KiB)

2019-02-13 13:41:33.890 56155 INFO nova.compute.manager [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [BRUCE](M-4): post_live_migration_at_destination: migrate_instance_finish(context, instance(Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone='nova',cell_name=None,cleaned=False,config_drive='',created_at=2019-02-12T11:40:46Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,device_metadata=<?>,disable_terminate=False,display_description='osa_compute10_1',display_name='osa_compute10_1',ec2_ids=<?>,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,flavor=Flavor(7),host='cc-compute10-kna1',hostname='osa-compute10-1',id=282634,image_ref='f14f070e-8634-485b-b822-4b12a5bf87a0',info_cache=InstanceInfoCache,instance_type_id=7,kernel_id='',key_data='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDJvlhaBlEg4cE2rzqr8XU5O/MWxgR2ERPeJEgMZJ0U8QOqKU1WvQUhdvY1aK+ZcIeiqX3HYcUwD3M7FgGCYD1eiKO7wuaBWC0fOxH3l7dx8WMOXW3kYswFgYen565arPu75ETaI4+tz3GFh9vnA+d6rkiCTUL1fvNs7Px37tN7J74Pxox17AuBGnyXtht2Vl0RGbbuHn1vbQ4Kk+Xw3AOgOonltOPW63IuprsiK1brAvwKn8XgLn5eT8/ovpLre2FQSiuzArHA15n+ppSY0du+MlZUQ2y63aKfZmdR765ORlXOiUToGfDxj0oowiTFvJO6VX0rHlkXU4CNlS7h1CsN /usr/local/lib/opensc-pkcs11.so',key_name='rcolvin_yubi',keypairs=<?>,launch_index=0,launched_at=2019-02-12T11:40:53Z,launched_on='cc-compute11-kna1',locked=False,locked_by=None,memory_mb=1024,metadata={},migration_context=<?>,new_flavor=None,node='cc-compute10-kna1',numa_topology=<?>,old_flavor=None,os_type=None,pci_devices=<?>,pci_requests=<?>,power_state=1,progress=80,project_id='a4cac02597bc4013b69a9cd9373f181c',ramdisk_id='',reservation_id='r-xqc9c0u0',root_device_name='/dev/vda',root_gb=20,security_groups=SecurityGroupList,services=<?>,shutdown_terminate=False,system_metadata={boot_roles='_member_,swiftoperator',image_base_image_ref='f14f070e-8634-485b-b822-4b12a5bf87a0',image_container_format='bare',image_disk_format='qcow2',image_min_disk='20',image_min_ram='0',owner_project_name='CN Eng',owner_user_name='CCP_User_35012_b8'},tags=<?>,task_state='migrating',terminated_at=None,trusted_certs=<?>,updated_at=2019-02-13T13:41:22Z,user_data='I2Nsb3VkLWNvbmZpZwpzeXN0ZW1faW5mbzoKIGRlZmF1bHRfdXNlcjoKICBuYW1lOiByY29sdmluCmNocGFzc3dkOiB7IGV4cGlyZTogRmFsc2UgfQpzc2hfcHdhdXRoOiBUcnVlCnBhY2thZ2VfdXBncmFkZTogdHJ1ZQphcHRfbWlycm9yOiBodHRwOi8vbWlycm9yLmNpdHluZXR3b3JrLnNlL3VidW50dS8K',user_id='d42829e49904441eb1834805fa1eb372',uuid=8fe8b490-13aa-4fc7-87a0-39706200b904,vcpu_model=<?>,vcpus=1,vm_mode=None,vm_state='active')), migration{cc-compute10-kna1, cc-compute29-kna1})
2019-02-13 13:41:33.953 56155 INFO nova.network.neutronv2.api [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904] Updating port e314d12d-aff2-4e60-ae16-3eda2b6156b5 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute29-kna1'}

2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a5646132...

Revision history for this message
Robert Colvin (rcolvin) wrote :

port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:39 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:40 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:41 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {"migrating_to": "cc-compute24-kna1"} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:42 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {"migrating_to": "cc-compute24-kna1"} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:43 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {"migrating_to": "cc-compute24-kna1"} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:44 UTC 2019

it never recovers from this point

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/649464

Revision history for this message
Robert Colvin (rcolvin) wrote :

RDO (Rocky)

[root@cc-compute10-kna1 nova]# virsh --version
setlocale: No such file or directory
3.9.0

Name : qemu-kvm
Arch : x86_64
Epoch : 10
Version : 1.5.3
Release : 160.el7_6.1

OSA (Rocky)

root@cc-compute29-kna1:~# virsh --version
4.0.0

qemu-kvm 1:2.11+dfsg-1ubuntu7.6

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/649464/ is an attempt at recreating this in our CI system, if it does recreate the nova-live-migration job should fail in the same way.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I've also posted devstack patch https://review.openstack.org/649470 to pin the [upgrade_levels]/compute=queens RPC API version to see what happens with live migration while all code is newer than queens.

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/649464/ didn't fail the nova-live-migration job, with or without post-copy enabled, meaning the destination port binding gets activated as part of post_live_migration_at_destination, which goes through the code that updates the port's binding:host_id value, which is what you reported caused the failure, so I'm not sure yet how to recreate this.

Revision history for this message
Robert Colvin (rcolvin) wrote : Re: [Bug 1822884] Re: live migration fails due to port binding duplicate key entry in post_live_migrate
Download full text (11.7 KiB)

Hi Matt

Here are some more detailed logs, which may assist - it seems easier to mail than try and copy paste into launchpad; plus I haven’t sanitised them

Compute10 == RDO (Rocky) -> Source
Compute29 == OSA (Rocky) -> Destination
Compute28 == OSA (Rocky) -> Source

i.e.: OSA - OSA is 28 -> 29
RDO - OSA is 10 - 29

Osa_osa_mysql is the port bindings throughout an OSA to osa migration
Mysql is the port bindings through an RDO - OSA migration

- Sorry, I haven’t got a capture like this for RDO - ROD immediately available; I can generate one though if it will help
- the lines [BRUCE] (XXX) are extra Logging statements I’ve added; if it is A-X it is in api.py; if it’s M-X its in migration.py (I can send these modified files as well)
- I have deconstructed the logs and added my own extra comments in some places

--

Robert Colvin
Senior System Engineer
Mobile: +46 700 194847

www.citynetwork.eu <http://www.citynetwork.eu/> | www.citycloud.com <https://www.citycloud.com/>

INNOVATION THROUGH OPEN IT INFRASTRUCTURE
ISO 9001, 14001, 27001, 27015 & 27018 CERTIFIED

> On 4 Apr 2019, at 04:33, Matt Riedemann <email address hidden> wrote:
>
> https://review.openstack.org/#/c/649464/ didn't fail the nova-live-
> migration job, with or without post-copy enabled, meaning the
> destination port binding gets activated as part of
> post_live_migration_at_destination, which goes through the code that
> updates the port's binding:host_id value, which is what you reported
> caused the failure, so I'm not sure yet how to recreate this.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1822884
>
> Title:
> live migration fails due to port binding duplicate key entry in
> post_live_migrate
>
> Status in OpenStack Compute (nova):
> New
>
> Bug description:
> We are converting a site from RDO to OSA; At this stage all control
> nodes and net nodes are running OSA (Rocky), some compute are running
> RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).
>
> We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to
> OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).
>
> When Live Migrating between RDO nodes we see no issues similarly when
> migrating between OSA nodes, we see no issue, however Live Migrating
> RDO -> OSA fails with the below error on the target.
>
> 2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
> 2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.
>
...

Robert Colvin (rcolvin)
information type: Public → Private
Revision history for this message
Robert Colvin (rcolvin) wrote :

migrated the entire site to rocky

MariaDB [nova]> select * from services where version != 35 and services.binary = 'nova-compute';
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| created_at | updated_at | deleted_at | id | host | binary | topic | report_count | disabled | deleted | disabled_reason | last_seen_up | forced_down | version | uuid |
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| 2014-12-16 12:06:06 | 2018-02-08 00:43:36 | 2018-04-25 16:07:06 | 10 | cc-compute01-kna1 | nova-compute | compute | 9898304 | 1 | 10 | NULL | 2018-02-08 00:18:16 | 0 | 0 | NULL |
| 2018-06-18 07:46:53 | 2019-03-05 10:26:11 | 2019-03-05 10:26:21 | 60 | cc-vrtx01-lsd01-kna1 | nova-compute | compute | 2233546 | 1 | 60 | NULL | 2019-03-05 10:26:11 | 0 | 30 | db17d323-b59c-43c6-b976-f7c8990cc362 |
| 2018-06-18 11:14:28 | 2019-03-05 10:26:39 | 2019-03-05 10:26:47 | 61 | cc-vrtx01-lsd02-kna1 | nova-compute | compute | 2241252 | 1 | 61 | NULL | 2019-03-05 10:26:39 | 0 | 30 | 45da9cfb-cb38-435a-803f-2c911fddcdd5 |
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+

bug exists still

2019-04-16 11:55:20.818 36394 ERROR oslo_db.api [req-e2470a21-391b-4268-ae94-a57eb173f19c e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '45c0656c-c8eb-4019-b1f2-4c90c8d69a7e-cc-compute25-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_type': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'45c0656c-c8eb-4019-b1f2-4c90c8d69a7e', 'host': u'cc-compute25-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)

information type: Private → Public
Revision history for this message
Robert Colvin (rcolvin) wrote :

after upgrading all nodes to rocky, RDO - RDO migrations are now broken as well. Testing a migration from 18.1.0 RDO -> 18.1.0 RDO fails with the same error

Revision history for this message
Robert Colvin (rcolvin) wrote :

restarting conductor has not resolved the issue

RDO - RDO live migration still problematic
OSA - OSA works still
Cold Migration (RDO - OSA) works ok

is it possible to disable binding-extended?
is it worth setting [compute]/upgrade_levels = queens ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/653506

Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/653609

Revision history for this message
Robert Colvin (rcolvin) wrote :

the patch above solves (at least temporarily) the issues; the backport patch has enabled several compute hosts to be live migrated from

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/649464

Revision history for this message
Matt Riedemann (mriedem) wrote :

A bit unrelated but I seem to have recreated the problem in the nova-multi-cell job in this patch for the cross-cell resize tests:

https://review.opendev.org/#/c/656656/

I'm migrating from a source host with a hostname that is a substring of the target host hostname, and mysql seems to not be handling that and raises the unique constraint error.

I'm not an expert on this, but reading http://www.mysqltutorial.org/mysql-unique/ and looking at https://review.opendev.org/#/c/404293/ it seems that primary key in the neutron table should have been a unique index rather than a primary key.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm still not really sure how the port binding stuff during live migration is working with the OSA->OSA live migrations but not the RDO stuff, I'm unsure what is updating the port's binding:host_id to the dest host before we call migrate_instance_finish in the OSA case but not the RDO case. Anyway, I for sure have a recreate in the cross-cell resize stuff so likely need Sean's fix anyway, and got the confirmation/details from Miguel in Neutron today on the difference between updating the port's binding:host_id indirectly by activating the dest host binding vs updating the port's binding:host_id directly to point at the dest host (when there is already an inactive dest host port binding). I guess I assumed those worked the same way, i.e. if you activate the inactive dest host port binding it updates the port's binding:host_id (works) and if you update the port's binding:host_id it would activate the dest host port binding (doesn't, and raises the error).

http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/%23openstack-neutron.2019-05-09.log.html#t2019-05-09T20:40:18

Revision history for this message
Matt Riedemann (mriedem) wrote :

I wonder if in your OSA case you had post-copy working during live migration such that the source host would activate the dest host bindings [1] before migrate_instance_finish ran on the dest [2]. Also note that _post_live_migration on the source will activate the dest host port binding as well [3], which is probably why we don't see this in the gate (upstream CI testing).

Anyway, maybe there was some kind of race in the RDO setup that the dest host port binding activation from _post_live_migration (on the source) didn't update the port's binding:host_id by the time migration_instance_finish ran in post_live_migration_at_destination on the dest, but that doesn't really make sense to me (they should be atomic).

Maybe in the RDO case the 'Port Bindings Extended' neutron extension wasn't showing up in compute so it didn't think it could do the port binding activation?

[1] https://opendev.org/openstack/nova/src/tag/18.0.0/nova/compute/manager.py#L1134
[2] https://opendev.org/openstack/nova/src/tag/18.0.0/nova/compute/manager.py#L6757
[3] https://opendev.org/openstack/nova/src/tag/18.0.0/nova/compute/manager.py#L6610

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
importance: High → Medium
Changed in nova:
assignee: sean mooney (sean-k-mooney) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → sean mooney (sean-k-mooney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/649464
Reason: This validates that https://review.opendev.org/#/c/653506/ resolves the issue reported in the bug.

Changed in nova:
assignee: sean mooney (sean-k-mooney) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → sean mooney (sean-k-mooney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/653609

Revision history for this message
sean mooney (sean-k-mooney) wrote :
Revision history for this message
sean mooney (sean-k-mooney) wrote :

actull never mind that wrong link

Revision history for this message
Henrik Holmboe (holmboe) wrote :

I read in https://bugs.launchpad.net/nova/+bug/1849802 that "this bug will only happen if you live migrate a vm and it failes for another reason and then you try to live migrate to the same host". I tried to migrate to another compute node and from what I can see this is a workaround to the problem.

Assuming that this bug and that is the same cause to begin with. YMMV.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.