OpenStack Compute (nova)

live migration fails due to port binding duplicate key entry in post_live_migrate

Bug #1822884 reported by Robert Colvin on 2019-04-02

This bug affects 4 people

	Status	Importance	Assigned to
OpenStack Compute (nova)	In Progress	Medium	sean mooney
Rocky	Triaged	Medium	Unassigned
Stein	Triaged	Medium	Unassigned

Bug Description

We are converting a site from RDO to OSA; At this stage all control nodes and net nodes are running OSA (Rocky), some compute are running RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).

We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).

When Live Migrating between RDO nodes we see no issues similarly when migrating between OSA nodes, we see no issue, however Live Migrating RDO -> OSA fails with the below error on the target.

2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.

Digging further into the logs, reveals an issue with duplicate keys:

2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)

this is confirmed when reviewing the ml2_port_bindings table:

MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| port_id | host | vif_type | vnic_type | vif_details | profile | status |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE |
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | | INACTIVE |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+

The exception is not caught and handled, the VM is stuck in migrating. According to OpenStack, the VM is still on the source compute node, whilst libvirt/virsh believes it to be on the target. Forcing the VM state to active, keeps the VM available, however rebooting will result in an ERROR state (this is resolved by destroying the VM in virsh on the target, forcing back to active state, and rebooting) nor can it be attempted to be migrated due to the error state in the DB (this can be fixed by manually removing the inactive port, and clearing the profile from the active port).

In discussions with both mnasser and sean-k-mooney, it is understood that there are two distinct live migration flows with regards to port binding

1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated

We can see is monitoring the ml2_port_binding table during live migrations that

1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
2. OSA -> OSA uses the new flow (two entries which are cleaned up)
3. RDO -> OSA use the new flow, two entries, which are not cleaned up

This is unexpected, as even in the RDO to RDO case, both nodes are Rocky and so the new process should be in use.

(https://git.openstack.org/cgit/openstack/nova-specs/tree/specs/queens/approved/neutron-new-port-binding-api.rst?h=refs/changes/80/375580/16)

Adding more debug statements to nova/network/neutronv2/api.py I can isolate where the code paths diverge in the RDO - RDO and RDO - OSA cases.

if port_migrating or teardown:
            LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
            search_opts = {'device_id': instance.uuid,
                           'tenant_id': instance.project_id,
                           BINDING_HOST_ID: instance.host}
            # Now get the port details to process the ports
            # binding profile info.
            LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)

Looking at the state here

RDO - OSA (2019-02-13 13:41:33.809)
------------------------------------------------------------
{
    device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID
    tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
    BINDING_HOST_ID(cc-compute10-kna1) <----- source
}

OSA - OSA (2019-02-13 14:33:36.333)
-------------------------------------------------------------------
{
    device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID
    tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
    BINDING_HOST_ID(cc-compute28-kna1)} <----- source
}

At this point in the flow, BINDING_HOST_ID is the source hostname

continuing:

data = self.list_ports(context, **search_opts)
ports = data['ports']
LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)

and examining again:

RDO - OSA (2019-02-13 13:41:33.887)
------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute10-kna1',
....
}
]

OSA - OSA: (2019-02-13 14:33:36.422)
-----------------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute29-kna1',
....
}
]

now we can see that in the RDO - OSA case, the binding:host_id as returned in the ports is the source hostname; whereas in the OSA - OSA case, the binding:host_id is the target hostname

The actual fault itself is in:

self.network_api.setup_networks_on_host(context, instance,
                                                         self.host)
        migration = {'source_compute': instance.host,
                     'dest_compute': self.host, }
        self.network_api.migrate_instance_finish(context,
                                                 instance,
                                                 migration)

if we trace those calls in the RDO - OSA case, then we will end up here:

# Avoid rolling back updates if we catch an error above.
        # TODO(lbeliveau): Batch up the port updates in one neutron call.
        for port_id, updates in port_updates:
            if updates:
                LOG.info("Updating port %(port)s with "
                         "attributes %(attributes)s",
                         {"port": port_id, "attributes": updates},
                         instance=instance)
                try:
                    neutron.update_port(port_id, {'port': updates})
                except Exception:
                    with excutils.save_and_reraise_exception():
                        LOG.exception("Unable to update binding details "
                                      "for port %s",
                                      port_id, instance=instance)

in the OSA - OSA case, these calls appear to have no effect, in the RDO - RDO case they cause the internal error (as they are updating the ports rather than activating/deleting as expected)

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-02:

This scenario doesn't really make sense to me:

1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)

Because if the RDO nodes are running Rocky code, they should hit this in the live migration task in the conductor service:

https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/tasks/live_migrate.py#L282

Which enables the new flow if:

1. The neutron "Port Bindings Extended" API extension is available
2. Both the source and target nova-compute service versions in the "services" table in the cell database are version >= 35, which if you're running Rocky everywhere should be the case, but you said you have some RDO computes that are still Queens.

So I'd double check that first.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-02:

Do you have [upgrade_levels]/compute pinned to queens or something in nova.conf?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-02:

(4:29:22 PM) NewBruce: [upgrade_levels]
(4:29:22 PM) NewBruce: compute = auto
(4:29:44 PM) mriedem: so that will pin to the lowest nova-compute service version in the deployment while you're upgrading,
(4:29:57 PM) mriedem: so if you have queens computes, it will be using the queens rpc versions and backlevel the migrate_data object
(4:30:12 PM) mriedem: which likely is dropping the vifs information for the 2nd deactivated port binding
(4:30:28 PM) mriedem: that's my guess anyway
(4:30:28 PM) NewBruce: Aha….
(4:30:37 PM) NewBruce: ok, so can we override that?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-02:

My suggestion would be to disable the neutron "Port Bindings Extended" API extension until you're fully upgraded to Rocky on all compute nodes because my guess is conductor is creating the deactivated target host port bindings because it's working with two computes that are both rocky, but because [upgrade_levels]/compute=auto and there are queens computes still, the versions of the MigrateData object is getting back-leveled which drops the 'vifs' field and that's what is used in the compute service code to determine if it's doing the new flow or the old flow.

tags:

added: live-migration neutron rocky-backport-potential upgrade

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-02:

Download full text (18.8 KiB)

2019-02-13 13:41:33.890 56155 INFO nova.compute.manager [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [BRUCE](M-4): post_live_migration_at_destination: migrate_instance_finish(context, instance(Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone='nova',cell_name=None,cleaned=False,config_drive='',created_at=2019-02-12T11:40:46Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,device_metadata=<?>,disable_terminate=False,display_description='osa_compute10_1',display_name='osa_compute10_1',ec2_ids=<?>,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,flavor=Flavor(7),host='cc-compute10-kna1',hostname='osa-compute10-1',id=282634,image_ref='f14f070e-8634-485b-b822-4b12a5bf87a0',info_cache=InstanceInfoCache,instance_type_id=7,kernel_id='',key_data='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDJvlhaBlEg4cE2rzqr8XU5O/MWxgR2ERPeJEgMZJ0U8QOqKU1WvQUhdvY1aK+ZcIeiqX3HYcUwD3M7FgGCYD1eiKO7wuaBWC0fOxH3l7dx8WMOXW3kYswFgYen565arPu75ETaI4+tz3GFh9vnA+d6rkiCTUL1fvNs7Px37tN7J74Pxox17AuBGnyXtht2Vl0RGbbuHn1vbQ4Kk+Xw3AOgOonltOPW63IuprsiK1brAvwKn8XgLn5eT8/ovpLre2FQSiuzArHA15n+ppSY0du+MlZUQ2y63aKfZmdR765ORlXOiUToGfDxj0oowiTFvJO6VX0rHlkXU4CNlS7h1CsN /usr/local/lib/opensc-pkcs11.so',key_name='rcolvin_yubi',keypairs=<?>,launch_index=0,launched_at=2019-02-12T11:40:53Z,launched_on='cc-compute11-kna1',locked=False,locked_by=None,memory_mb=1024,metadata={},migration_context=<?>,new_flavor=None,node='cc-compute10-kna1',numa_topology=<?>,old_flavor=None,os_type=None,pci_devices=<?>,pci_requests=<?>,power_state=1,progress=80,project_id='a4cac02597bc4013b69a9cd9373f181c',ramdisk_id='',reservation_id='r-xqc9c0u0',root_device_name='/dev/vda',root_gb=20,security_groups=SecurityGroupList,services=<?>,shutdown_terminate=False,system_metadata={boot_roles='_member_,swiftoperator',image_base_image_ref='f14f070e-8634-485b-b822-4b12a5bf87a0',image_container_format='bare',image_disk_format='qcow2',image_min_disk='20',image_min_ram='0',owner_project_name='CN Eng',owner_user_name='CCP_User_35012_b8'},tags=<?>,task_state='migrating',terminated_at=None,trusted_certs=<?>,updated_at=2019-02-13T13:41:22Z,user_data='I2Nsb3VkLWNvbmZpZwpzeXN0ZW1faW5mbzoKIGRlZmF1bHRfdXNlcjoKICBuYW1lOiByY29sdmluCmNocGFzc3dkOiB7IGV4cGlyZTogRmFsc2UgfQpzc2hfcHdhdXRoOiBUcnVlCnBhY2thZ2VfdXBncmFkZTogdHJ1ZQphcHRfbWlycm9yOiBodHRwOi8vbWlycm9yLmNpdHluZXR3b3JrLnNlL3VidW50dS8K',user_id='d42829e49904441eb1834805fa1eb372',uuid=8fe8b490-13aa-4fc7-87a0-39706200b904,vcpu_model=<?>,vcpus=1,vm_mode=None,vm_state='active')), migration{cc-compute10-kna1, cc-compute29-kna1})
2019-02-13 13:41:33.953 56155 INFO nova.network.neutronv2.api [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904] Updating port e314d12d-aff2-4e60-ae16-3eda2b6156b5 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute29-kna1'}

2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a5646132...

2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904] Unable to update binding details for port e314d12d-aff2-4e60-ae16-3eda2b6156b5: InternalServerError: Request Failed: internal server error while processing your request.
Neutron server returns request_ids: ['req-94584484-32b4-40c2-8740-9ba56b92a7fc']
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904] Traceback (most recent call last):
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 3076, in _update_port_binding_for_instance
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     neutron.update_port(port_id, {'port': updates})
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 799, in update_port
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     revision_number=revision_number)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 2375, in _update_resource
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     return self.put(path, **kwargs)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 363, in put
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     headers=headers, params=params)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 331, in retry_request
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     headers=headers, params=params)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     self._handle_fault_response(status_code, replybody, resp)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     exception_handler_v20(status_code, error_body)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]     request_ids=request_ids)
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904] InternalServerError: Request Failed: internal server error while processing your request.
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904] Neutron server returns request_ids: ['req-94584484-32b4-40c2-8740-9ba56b92a7fc']
2019-02-13 13:42:21.750 56155 ERROR nova.network.neutronv2.api [instance: 8fe8b490-13aa-4fc7-87a0-39706200b904]
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server [req-13e8d467-b162-42f4-8c0a-c0fae9e32df7 a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] Exception during message handling: InternalServerError: Request Failed: internal server error while processing your request.
Neutron server returns request_ids: ['req-94584484-32b4-40c2-8740-9ba56b92a7fc']
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, in wrapped
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     function_name, call_dict, binary, tb)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in wrapped
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/compute/utils.py", line 1141, in decorated_function
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/compute/manager.py", line 216, in decorated_function
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/compute/manager.py", line 204, in decorated_function
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/compute/manager.py", line 6783, in post_live_migration_at_destination
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     migration)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 2598, in migrate_instance_finish
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     migration=migration)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 3081, in _update_port_binding_for_instance
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     port_id, instance=instance)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 3076, in _update_port_binding_for_instance
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     neutron.update_port(port_id, {'port': updates})
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 799, in update_port
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     revision_number=revision_number)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 2375, in _update_resource
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     return self.put(path, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 363, in put
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     headers=headers, params=params)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 331, in retry_request
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     headers=headers, params=params)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     self._handle_fault_response(status_code, replybody, resp)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 126, in wrapper
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     ret = obj(*args, **kwargs)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     exception_handler_v20(status_code, error_body)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server   File "/openstack/venvs/nova-18.0.0/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server     request_ids=request_ids)
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server InternalServerError: Request Failed: internal server error while processing your request.
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server Neutron server returns request_ids: ['req-94584484-32b4-40c2-8740-9ba56b92a7fc']
2019-02-13 13:42:21.816 56155 ERROR oslo_messaging.rpc.server
^C
root@cc-comput

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-02:

port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:39 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:40 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:41 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {"migrating_to": "cc-compute24-kna1"} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:42 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {"migrating_to": "cc-compute24-kna1"} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:43 UTC 2019
port_id host vif_type vnic_type vif_details profile status
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute10-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} {"migrating_to": "cc-compute24-kna1"} ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f cc-compute24-kna1 ovs normal {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} INACTIVE
Thu Feb 7 12:37:44 UTC 2019

it never recovers from this point

port_id	host	vif_type	vnic_type	vif_details	profile	status
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute10-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}	{}	ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute24-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}		INACTIVE
Thu Feb  7 12:37:39 UTC 2019
port_id	host	vif_type	vnic_type	vif_details	profile	status
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute10-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}	{}	ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute24-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}		INACTIVE
Thu Feb  7 12:37:40 UTC 2019
port_id	host	vif_type	vnic_type	vif_details	profile	status
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute10-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}	{}	ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute24-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}		INACTIVE
Thu Feb  7 12:37:41 UTC 2019
port_id	host	vif_type	vnic_type	vif_details	profile	status
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute10-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}	{"migrating_to": "cc-compute24-kna1"}	ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute24-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}		INACTIVE
Thu Feb  7 12:37:42 UTC 2019
port_id	host	vif_type	vnic_type	vif_details	profile	status
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute10-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}	{"migrating_to": "cc-compute24-kna1"}	ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute24-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}		INACTIVE
Thu Feb  7 12:37:43 UTC 2019
port_id	host	vif_type	vnic_type	vif_details	profile	status
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute10-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}	{"migrating_to": "cc-compute24-kna1"}	ACTIVE
7cb143c7-2f20-413a-8f74-920ec538460f	cc-compute24-kna1	ovs	normal	{"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}		INACTIVE
Thu Feb  7 12:37:44 UTC 2019

it never recovers from this point

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-02: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/649464

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-02:

RDO (Rocky)

[root@cc-compute10-kna1 nova]# virsh --version
setlocale: No such file or directory
3.9.0

Name : qemu-kvm
Arch : x86_64
Epoch : 10
Version : 1.5.3
Release : 160.el7_6.1

OSA (Rocky)

root@cc-compute29-kna1:~# virsh --version
4.0.0

qemu-kvm 1:2.11+dfsg-1ubuntu7.6

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-02:

https://review.openstack.org/#/c/649464/ is an attempt at recreating this in our CI system, if it does recreate the nova-live-migration job should fail in the same way.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-02:

#10

I've also posted devstack patch https://review.openstack.org/649470 to pin the [upgrade_levels]/compute=queens RPC API version to see what happens with live migration while all code is newer than queens.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-04:

#11

https://review.openstack.org/#/c/649464/ didn't fail the nova-live-migration job, with or without post-copy enabled, meaning the destination port binding gets activated as part of post_live_migration_at_destination, which goes through the code that updates the port's binding:host_id value, which is what you reported caused the failure, so I'm not sure yet how to recreate this.

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-04: Re: [Bug 1822884] Re: live migration fails due to port binding duplicate key entry in post_live_migrate

#13

Download full text (11.7 KiB)

Hi Matt

Here are some more detailed logs, which may assist - it seems easier to mail than try and copy paste into launchpad; plus I haven’t sanitised them

Compute10 == RDO (Rocky) -> Source
Compute29 == OSA (Rocky) -> Destination
Compute28 == OSA (Rocky) -> Source

i.e.: OSA - OSA is 28 -> 29
RDO - OSA is 10 - 29

Osa_osa_mysql is the port bindings throughout an OSA to osa migration
Mysql is the port bindings through an RDO - OSA migration

- Sorry, I haven’t got a capture like this for RDO - ROD immediately available; I can generate one though if it will help
- the lines [BRUCE] (XXX) are extra Logging statements I’ve added; if it is A-X it is in api.py; if it’s M-X its in migration.py (I can send these modified files as well)
- I have deconstructed the logs and added my own extra comments in some places

Robert Colvin
Senior System Engineer
Mobile: +46 700 194847

www.citynetwork.eu <http://www.citynetwork.eu/> | www.citycloud.com <https://www.citycloud.com/>

INNOVATION THROUGH OPEN IT INFRASTRUCTURE
ISO 9001, 14001, 27001, 27015 & 27018 CERTIFIED

> On 4 Apr 2019, at 04:33, Matt Riedemann <email address hidden> wrote:
>
> https://review.openstack.org/#/c/649464/ didn't fail the nova-live-
> migration job, with or without post-copy enabled, meaning the
> destination port binding gets activated as part of
> post_live_migration_at_destination, which goes through the code that
> updates the port's binding:host_id value, which is what you reported
> caused the failure, so I'm not sure yet how to recreate this.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1822884
>
> Title:
> live migration fails due to port binding duplicate key entry in
> post_live_migrate
>
> Status in OpenStack Compute (nova):
> New
>
> Bug description:
> We are converting a site from RDO to OSA; At this stage all control
> nodes and net nodes are running OSA (Rocky), some compute are running
> RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).
>
> We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to
> OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).
>
> When Live Migrating between RDO nodes we see no issues similarly when
> migrating between OSA nodes, we see no issue, however Live Migrating
> RDO -> OSA fails with the below error on the target.
>
> 2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
> 2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.
>
...

Hi Matt

Here are some more detailed logs, which may assist - it seems easier to mail than try and copy paste into launchpad; plus I haven’t sanitised them

Compute10 == RDO (Rocky) -> Source
Compute29 == OSA (Rocky) -> Destination
Compute28 == OSA (Rocky) -> Source

i.e.: OSA - OSA is 28 -> 29
RDO - OSA is 10 - 29

Osa_osa_mysql is the port bindings throughout an OSA to osa migration
Mysql is the port bindings through an RDO - OSA migration

Robert Colvin
Senior System Engineer
Mobile: +46 700 194847

www.citynetwork.eu <http://www.citynetwork.eu/> |  www.citycloud.com <https://www.citycloud.com/>

INNOVATION THROUGH OPEN IT INFRASTRUCTURE
ISO 9001, 14001, 27001, 27015 & 27018 CERTIFIED

> On 4 Apr 2019, at 04:33, Matt Riedemann <mriedem.os@gmail.com> wrote:
> 
> https://review.openstack.org/#/c/649464/ didn't fail the nova-live-
> migration job, with or without post-copy enabled, meaning the
> destination port binding gets activated as part of
> post_live_migration_at_destination, which goes through the code that
> updates the port's binding:host_id value, which is what you reported
> caused the failure, so I'm not sure yet how to recreate this.
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1822884
> 
> Title:
>  live migration fails due to port binding duplicate key entry in
>  post_live_migrate
> 
> Status in OpenStack Compute (nova):
>  New
> 
> Bug description:
>  We are converting a site from RDO to OSA; At this stage all control
>  nodes and net nodes are running OSA (Rocky), some compute are running
>  RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).
> 
>  We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to
>  OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).
> 
>  When Live Migrating between RDO nodes we see no issues similarly when
>  migrating between OSA nodes, we see no issue, however Live Migrating
>  RDO -> OSA fails with the below error on the target.
> 
>  2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
>  2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.
> 
>  Digging further into the logs, reveals an issue with duplicate keys:
> 
>  2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
>  pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
>  pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)
> 
>  this is confirmed when reviewing the ml2_port_bindings table:
> 
>  MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
>  +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
>  | port_id                              | host              | vif_type | vnic_type | vif_details                                                               | profile                               | status   |
>  +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
>  | 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs      | normal    | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE   |
>  | 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs      | normal    | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} |                                       | INACTIVE |
>  +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
> 
>  The exception is not caught and handled, the VM is stuck in migrating.
>  According to OpenStack, the VM is still on the source compute node,
>  whilst libvirt/virsh believes it to be on the target. Forcing the VM
>  state to active, keeps the VM available, however rebooting will result
>  in an ERROR state (this is resolved by destroying the VM in virsh on
>  the target, forcing back to active state, and rebooting) nor can it be
>  attempted to be migrated due to the error state in the DB (this can be
>  fixed by manually removing the inactive port, and clearing the profile
>  from the active port).
> 
>  In discussions with both mnasser and sean-k-mooney, it is understood
>  that there are two distinct live migration flows with regards to port
>  binding
> 
>  1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
>  2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated
> 
>  We can see is monitoring the ml2_port_binding table during live
>  migrations that
> 
>  1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
>  2. OSA -> OSA uses the new flow (two entries which are cleaned up)
>  3. RDO -> OSA use the new flow, two entries, which are not cleaned up
> 
>  This is unexpected, as even in the RDO to RDO case, both nodes are
>  Rocky and so the new process should be in use.
> 
>  (https://git.openstack.org/cgit/openstack/nova-
>  specs/tree/specs/queens/approved/neutron-new-port-binding-
>  api.rst?h=refs/changes/80/375580/16)
> 
>  Adding more debug statements to nova/network/neutronv2/api.py I can
>  isolate where the code paths diverge in the RDO - RDO and RDO - OSA
>  cases.
> 
>  if port_migrating or teardown:
>              LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
>              search_opts = {'device_id': instance.uuid,
>                             'tenant_id': instance.project_id,
>                             BINDING_HOST_ID: instance.host}
>              # Now get the port details to process the ports
>              # binding profile info.
>              LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)
> 
>  Looking at the state here
> 
>  RDO - OSA (2019-02-13 13:41:33.809)                                
>  ------------------------------------------------------------       
>  {                                                                  
>      device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID   
>      tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id 
>      BINDING_HOST_ID(cc-compute10-kna1) <----- source               
>  }  
> 
>  OSA - OSA (2019-02-13 14:33:36.333)                                
>  -------------------------------------------------------------------
>  {                                                                  
>      device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID   
>      tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id 
>      BINDING_HOST_ID(cc-compute28-kna1)} <----- source              
>  }
> 
>  At this point in the flow, BINDING_HOST_ID is the source hostname
> 
>  continuing:
> 
>  data = self.list_ports(context, **search_opts) 
>  ports = data['ports']
>  LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)
> 
>  and examining again:
> 
>  RDO - OSA (2019-02-13 13:41:33.887)                                    
>  ------------------------------------------------------------           
>  ports =                                                                
>  [   
>  ....
>  u'binding:host_id': u'cc-compute10-kna1', 
>  ....
>  }                                                                  
>  ]     
> 
>  OSA - OSA: (2019-02-13 14:33:36.422)                                   
>  -----------------------------------------------------------------------
>  ports =                                                                
>  [     
>  ....
>  u'binding:host_id': u'cc-compute29-kna1',
>  ....
>  }                                                                  
>  ]   
> 
>  now we can see that in the RDO - OSA case, the binding:host_id as
>  returned in the ports is the source hostname; whereas in the OSA - OSA
>  case, the binding:host_id is the target hostname
> 
>  The actual fault itself is in:
> 
>  self.network_api.setup_networks_on_host(context, instance,
>                                                           self.host)
>          migration = {'source_compute': instance.host,
>                       'dest_compute': self.host, }
>          self.network_api.migrate_instance_finish(context,
>                                                   instance,
>                                                   migration)
> 
>  if we trace those calls in the RDO - OSA case, then we will end up
>  here:
> 
>  # Avoid rolling back updates if we catch an error above.
>          # TODO(lbeliveau): Batch up the port updates in one neutron call.
>          for port_id, updates in port_updates:
>              if updates:
>                  LOG.info("Updating port %(port)s with "
>                           "attributes %(attributes)s",
>                           {"port": port_id, "attributes": updates},
>                           instance=instance)
>                  try:
>                      neutron.update_port(port_id, {'port': updates})
>                  except Exception:
>                      with excutils.save_and_reraise_exception():
>                          LOG.exception("Unable to update binding details "
>                                        "for port %s",
>                                        port_id, instance=instance)
> 
>  in the OSA - OSA case, these calls appear to have no effect, in the
>  RDO - RDO case they cause the internal error (as they are updating the
>  ports rather than activating/deleting as expected)
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/1822884/+subscriptions

Robert Colvin (rcolvin) on 2019-04-04

information type:

Public → Private

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-16:

#14

migrated the entire site to rocky

MariaDB [nova]> select * from services where version != 35 and services.binary = 'nova-compute';
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| created_at | updated_at | deleted_at | id | host | binary | topic | report_count | disabled | deleted | disabled_reason | last_seen_up | forced_down | version | uuid |
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| 2014-12-16 12:06:06 | 2018-02-08 00:43:36 | 2018-04-25 16:07:06 | 10 | cc-compute01-kna1 | nova-compute | compute | 9898304 | 1 | 10 | NULL | 2018-02-08 00:18:16 | 0 | 0 | NULL |
| 2018-06-18 07:46:53 | 2019-03-05 10:26:11 | 2019-03-05 10:26:21 | 60 | cc-vrtx01-lsd01-kna1 | nova-compute | compute | 2233546 | 1 | 60 | NULL | 2019-03-05 10:26:11 | 0 | 30 | db17d323-b59c-43c6-b976-f7c8990cc362 |
| 2018-06-18 11:14:28 | 2019-03-05 10:26:39 | 2019-03-05 10:26:47 | 61 | cc-vrtx01-lsd02-kna1 | nova-compute | compute | 2241252 | 1 | 61 | NULL | 2019-03-05 10:26:39 | 0 | 30 | 45da9cfb-cb38-435a-803f-2c911fddcdd5 |
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+

bug exists still

2019-04-16 11:55:20.818 36394 ERROR oslo_db.api [req-e2470a21-391b-4268-ae94-a57eb173f19c e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '45c0656c-c8eb-4019-b1f2-4c90c8d69a7e-cc-compute25-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_type': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'45c0656c-c8eb-4019-b1f2-4c90c8d69a7e', 'host': u'cc-compute25-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)

migrated the entire site to rocky

MariaDB [nova]> select * from services where version != 35 and services.binary = 'nova-compute';
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| created_at          | updated_at          | deleted_at          | id | host                 | binary       | topic   | report_count | disabled | deleted | disabled_reason | last_seen_up        | forced_down | version | uuid                                 |
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| 2014-12-16 12:06:06 | 2018-02-08 00:43:36 | 2018-04-25 16:07:06 | 10 | cc-compute01-kna1    | nova-compute | compute |      9898304 |        1 |      10 | NULL            | 2018-02-08 00:18:16 |           0 |       0 | NULL                                 |
| 2018-06-18 07:46:53 | 2019-03-05 10:26:11 | 2019-03-05 10:26:21 | 60 | cc-vrtx01-lsd01-kna1 | nova-compute | compute |      2233546 |        1 |      60 | NULL            | 2019-03-05 10:26:11 |           0 |      30 | db17d323-b59c-43c6-b976-f7c8990cc362 |
| 2018-06-18 11:14:28 | 2019-03-05 10:26:39 | 2019-03-05 10:26:47 | 61 | cc-vrtx01-lsd02-kna1 | nova-compute | compute |      2241252 |        1 |      61 | NULL            | 2019-03-05 10:26:39 |           0 |      30 | 45da9cfb-cb38-435a-803f-2c911fddcdd5 |
+---------------------+---------------------+---------------------+----+----------------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+

bug exists still

information type:

Private → Public

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-16:

#15

after upgrading all nodes to rocky, RDO - RDO migrations are now broken as well. Testing a migration from 18.1.0 RDO -> 18.1.0 RDO fails with the same error

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-17:

#16

restarting conductor has not resolved the issue

RDO - RDO live migration still problematic
OSA - OSA works still
Cold Migration (RDO - OSA) works ok

is it possible to disable binding-extended?
is it worth setting [compute]/upgrade_levels = queens ?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-17: Fix proposed to nova (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/653506

Changed in nova:
assignee:	nobody → sean mooney (sean-k-mooney)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-17: Fix proposed to nova (stable/rocky)

#18

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/653609

Revision history for this message

Robert Colvin (rcolvin) wrote on 2019-04-23:

#19

the patch above solves (at least temporarily) the issues; the backport patch has enabled several compute hosts to be live migrated from

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-26: Change abandoned on nova (master)

#20

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/649464

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-05-04:

#21

A bit unrelated but I seem to have recreated the problem in the nova-multi-cell job in this patch for the cross-cell resize tests:

https://review.opendev.org/#/c/656656/

I'm migrating from a source host with a hostname that is a substring of the target host hostname, and mysql seems to not be handling that and raises the unique constraint error.

I'm not an expert on this, but reading http://www.mysqltutorial.org/mysql-unique/ and looking at https://review.opendev.org/#/c/404293/ it seems that primary key in the neutron table should have been a unique index rather than a primary key.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-05-09:

#22

I'm still not really sure how the port binding stuff during live migration is working with the OSA->OSA live migrations but not the RDO stuff, I'm unsure what is updating the port's binding:host_id to the dest host before we call migrate_instance_finish in the OSA case but not the RDO case. Anyway, I for sure have a recreate in the cross-cell resize stuff so likely need Sean's fix anyway, and got the confirmation/details from Miguel in Neutron today on the difference between updating the port's binding:host_id indirectly by activating the dest host binding vs updating the port's binding:host_id directly to point at the dest host (when there is already an inactive dest host port binding). I guess I assumed those worked the same way, i.e. if you activate the inactive dest host port binding it updates the port's binding:host_id (works) and if you update the port's binding:host_id it would activate the dest host port binding (doesn't, and raises the error).

http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/%23openstack-neutron.2019-05-09.log.html#t2019-05-09T20:40:18

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-05-09:

#23

I wonder if in your OSA case you had post-copy working during live migration such that the source host would activate the dest host bindings [1] before migrate_instance_finish ran on the dest [2]. Also note that _post_live_migration on the source will activate the dest host port binding as well [3], which is probably why we don't see this in the gate (upstream CI testing).

Anyway, maybe there was some kind of race in the RDO setup that the dest host port binding activation from _post_live_migration (on the source) didn't update the port's binding:host_id by the time migration_instance_finish ran in post_live_migration_at_destination on the dest, but that doesn't really make sense to me (they should be atomic).

Maybe in the RDO case the 'Port Bindings Extended' neutron extension wasn't showing up in compute so it didn't think it could do the port binding activation?

[1] https://opendev.org/openstack/nova/src/tag/18.0.0/nova/compute/manager.py#L1134
[2] https://opendev.org/openstack/nova/src/tag/18.0.0/nova/compute/manager.py#L6757
[3] https://opendev.org/openstack/nova/src/tag/18.0.0/nova/compute/manager.py#L6610

Matt Riedemann (mriedem) on 2019-05-09

Changed in nova:
importance:	Undecided → High
importance:	High → Medium

OpenStack Infra (hudson-openstack) on 2019-05-09

Changed in nova:
assignee:	sean mooney (sean-k-mooney) → Matt Riedemann (mriedem)

Matt Riedemann (mriedem) on 2019-05-10

Changed in nova:
assignee:	Matt Riedemann (mriedem) → sean mooney (sean-k-mooney)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-10:

#24

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/649464
Reason: This validates that https://review.opendev.org/#/c/653506/ resolves the issue reported in the bug.

OpenStack Infra (hudson-openstack) on 2019-05-13

Changed in nova:
assignee:	sean mooney (sean-k-mooney) → Matt Riedemann (mriedem)

Matt Riedemann (mriedem) on 2019-05-13

Changed in nova:
assignee:	Matt Riedemann (mriedem) → sean mooney (sean-k-mooney)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-17: Change abandoned on nova (stable/rocky)

#25

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/653609

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2019-11-05:

#26

so is this a duplicate of https://bugs.launchpad.net/nova/+bug/1788014

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2019-11-05:

#27

actull never mind that wrong link

Revision history for this message

Henrik Holmboe (holmboe) wrote on 2021-09-23:

#28

I read in https://bugs.launchpad.net/nova/+bug/1849802 that "this bug will only happen if you live migrate a vm and it failes for another reason and then you try to live migrate to the same host". I tried to migrate to another compute node and from what I can see this is a workaround to the problem.

Assuming that this bug and that is the same cause to begin with. YMMV.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.