Live-migration fails without a logged root cause

Bug #1849802 reported by Barry Price
92
This bug affects 20 people
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Unassigned

Bug Description

Operating system distribution and version: Ubuntu 18.04

Neutron package version: 2:14.0.2-0ubuntu1~cloud0
Nova package version: 2:19.0.1-0ubuntu2.1~cloud0

Cloud was deploying (using Juju charms) as rocky, then upgraded to stein.

There are a number of instances that I need to migrate from one compute node to another, but:

$ openstack server migrate --block-migration 8703d9db-81b0-4e86-a2ef-c4ba5250556c --live shinx --disk-overcommit
Migration pre-check error: Binding failed for port 5a3c5d23-8727-47d2-af72-a53b495358d2, please check neutron logs for more information. (HTTP 400) (Request-ID: req-7c41ae70-6f5b-48a8-9d09-add2bbbe2b7e)
$

However, even with debug logging enabled, all that shows up in the neutron-api logs is:

2019-10-25 09:34:12.147 1569534 INFO neutron.wsgi [req-ac358ed5-cfec-4618-b765-f2defd5a3aac 92e98c5c687a46d29ec28aca3025f3da 7555fff7e7eb4a0eb28149905b266a2b - 207337407e3647798c0f68a0a28a0f8b 207337407e3647798c0f68a0a28a0f8b] 10.x.y.z,127.0.0.1 "POST /v2.0/ports/5a3c5d23-8727-47d2-af72-a53b495358d2/bindings HTTP/1.1" status: 409 len: 371 time: 0.1632745

Which suggests that for some reason, the API call to retrieve port bindings is failing, but there's no further information in the logs for me to debug exactly why.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Port binding failures are usually in the neutron agent logs rather than the API logs so check the neutron agent logs for that port around the time of the failure.

tags: added: live-migration neutron
Revision history for this message
Barry Price (barryprice) wrote :

I've repeated this today and searched through all logs on all nodes in the cloud, but the only further lines of relevance were on the nova-conductor service:

/var/log/nova/nova-conductor.log:2019-10-26 06:40:21.915 1042354 ERROR nova.network.neutronv2.api [req-2772ad1b-488b-41c7-9916-2808293415a4 349e52d708d542bab601d9698a3e4880 df9337688f7f4041bb2db4c4942de6a0 - 3e15c9d8c5254828a3fa73e76af71f75 3e15c9d8c5254828a3fa73e76af71f75] [instance: 8703d9db-81b0-4e86-a2ef-c4ba5250556c] Binding failed for port 5a3c5d23-8727-47d2-af72-a53b495358d2 and host shinx. Error: (409 {"NeutronError": {"type": "PortBindingAlreadyExists", "message": "Binding for port 5a3c5d23-8727-47d2-af72-a53b495358d2 on host shinx already exists.", "detail": ""}})
/var/log/nova/nova-conductor.log:2019-10-26 06:40:22.909 1042354 WARNING nova.scheduler.utils [req-2772ad1b-488b-41c7-9916-2808293415a4 349e52d708d542bab601d9698a3e4880 df9337688f7f4041bb2db4c4942de6a0 - 3e15c9d8c5254828a3fa73e76af71f75 3e15c9d8c5254828a3fa73e76af71f75] Failed to compute_task_migrate_server: Migration pre-check error: Binding failed for port 5a3c5d23-8727-47d2-af72-a53b495358d2, please check neutron logs for more information.: nova.exception.MigrationPreCheckError: Migration pre-check error: Binding failed for port 5a3c5d23-8727-47d2-af72-a53b495358d2, please check neutron logs for more information.
/var/log/nova/nova-conductor.log:2019-10-26 06:40:22.911 1042354 WARNING nova.scheduler.utils [req-2772ad1b-488b-41c7-9916-2808293415a4 349e52d708d542bab601d9698a3e4880 df9337688f7f4041bb2db4c4942de6a0 - 3e15c9d8c5254828a3fa73e76af71f75 3e15c9d8c5254828a3fa73e76af71f75] [instance: 8703d9db-81b0-4e86-a2ef-c4ba5250556c] Setting instance to ACTIVE state.: nova.exception.MigrationPreCheckError: Migration pre-check error: Binding failed for port 5a3c5d23-8727-47d2-af72-a53b495358d2, please check neutron logs for more information.

So I'm guessing perhaps an earlier attempt at migration got as far as attaching one of the ports for this instance on the destination host (shinx) but not detaching it on the source host, and now it has an entry for both hosts (I'm not particularly familiar with the schema, so I could be way off here).

Revision history for this message
Paul Collins (pjdc) wrote :
Download full text (6.3 KiB)

After some digging around, and reading LP:1822884, which may be related or the same bug, I found that there were extra entries in the ml2_port_bindings and ml2_port_binding_levels tables for the port in question:

mysql> select * from ml2_port_bindings where port_id = '5a3c5d23-8727-47d2-af72-a53b495358d2';
+--------------------------------------+---------+----------+-----------+---------+----------------------------------------------------------------------------------------------------+----------+
| port_id | host | vif_type | vnic_type | profile | vif_details | status |
+--------------------------------------+---------+----------+-----------+---------+----------------------------------------------------------------------------------------------------+----------+
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | happiny | ovs | normal | | {"port_filter": true, "ovs_hybrid_plug": true, "datapath_type": "system", "bridge_name": "br-int"} | INACTIVE |
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | jynx | ovs | normal | | {"port_filter": true, "ovs_hybrid_plug": true, "datapath_type": "system", "bridge_name": "br-int"} | INACTIVE |
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | shinx | ovs | normal | | {"port_filter": true, "ovs_hybrid_plug": true, "datapath_type": "system", "bridge_name": "br-int"} | INACTIVE |
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | sliggoo | ovs | normal | | {"port_filter": true, "ovs_hybrid_plug": true, "datapath_type": "system", "bridge_name": "br-int"} | ACTIVE |
+--------------------------------------+---------+----------+-----------+---------+----------------------------------------------------------------------------------------------------+----------+
4 rows in set (0.00 sec)

mysql> select * from ml2_port_binding_levels where port_id = '5a3c5d23-8727-47d2-af72-a53b495358d2';

+--------------------------------------+---------+-------+-------------+--------------------------------------+
| port_id | host | level | driver | segment_id |
+--------------------------------------+---------+-------+-------------+--------------------------------------+
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | happiny | 0 | openvswitch | afaa634a-dff5-437e-a7a0-8630c21b2a73 |
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | jynx | 0 | openvswitch | afaa634a-dff5-437e-a7a0-8630c21b2a73 |
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | shinx | 0 | openvswitch | afaa634a-dff5-437e-a7a0-8630c21b2a73 |
| 5a3c5d23-8727-47d2-af72-a53b495358d2 | sliggoo | 0 | openvswitch | afaa634a-dff5-437e-a7a0-8630c21b2a73 |
+--------------------------------------+---------+-------+-------------+--------------------------------------+
4 rows in set (0.00 sec)

After removing the INACTIVE rows from the first table, and the entries for every host exc...

Read more...

Revision history for this message
do3meli (d-info-e) wrote :

we experiencing similar behaviour with live migrations. all of a sudden migrations begin to stuck (with no obvious reason in nova log) and finally we are seeing entries in those 2 tables you mentioned with INACTIVE state. we are running a stein cloud with provider networks managed with openvswitch.

Haw Loeung (hloeung)
Changed in nova:
status: New → Confirmed
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Got hit by this as well, when you force live migration schedule to a compute and it doesn't have resources it doesn't cleanup the port binding when it fails.

By doing as suggested above in neutron database (and having resources available on compute again)

delete from ml2_port_bindings where port_id='f83f20ad-feff-4369-a752-a81964bcfd52' and host='new-host-that-failed';
delete from ml2_port_binding_levels where port_id='f83f20ad-feff-4369-a752-a81964bcfd52' and host='new-host-that-failed';

Revision history for this message
masterpe (michiel-y) wrote :

I see this in Train, thanks for the workaround.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

you dont need to do db edits for this.
the correct way to work around this is via the neutron api

NEUTRON_API_URL=$(openstack endpoint list --service neutron --interface public -c URL -f value)
TOKEN=$(openstack token issue -c id -f value)
curl -X DELETE -H "X-Auth-Token: $TOKEN" ${NEUTRON_API_URL}/v2.0/ports/${port uuid}/bindings/${inactive host name}

we likely can handel this in nova by catching the conflict and deleting and recreating the inactive port binding.

we have temporally documented teh workaround downstream in a kcs artical
https://access.redhat.com/solutions/5827011
that i belive is public but yes i think the issue is that nova is not always cleaning up the inactive port bindings if a
a migration fails.

if you later try to migrate to the same host it will then fail as repoted.

triaging as medium as this bug will only happen if you live migrate a vm and it failes for another reason and then you try to live migrate to the same host.
so most operaters should not hit this but it can happen.

Changed in nova:
importance: Undecided → Medium
status: Confirmed → Triaged
Revision history for this message
sean mooney (sean-k-mooney) wrote :

by the way --live is deprecated since it does a force migration and is remvoed in later microversoins you should use --live-migration and --host if you want to select a host.

  --live-migration Live migrate the server. Use the ``--host`` option to specify a target host for the migration which will be validated by the scheduler.
  --live <hostname> **Deprecated** This option is problematic in that it requires a host and prior to compute API version 2.30, specifying a host during live migration will bypass validation by the scheduler
                        which could result in failures to actually migrate the server to the specified host or over-subscribe the host. Use the ``--live-migration`` option instead. If both this option and
                        ``--live-migration`` are used, ``--live-migration`` takes priority.
  --host <hostname> Migrate the server to the specified host. Requires ``--os-compute-api-version`` 2.30 or greater when used with the ``--live-migration`` option, otherwise requires ``--os-compute-api-
                        version`` 2.56 or greater.

if you use --live-migration and --host the schduler will check the instance can actuly fit and is appropriate for the host. if you use --live you as the admin have to manually validate that because you are forceing the migration which can lead to broken states.

our expreince over the year is that allowing the operator to force evacuations or live migration has done more harm then good over the years so we remvoed that in microverion 2.68
https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id62

you can still do it with the old microverion but if you can avoid it you should.

Changed in nova:
assignee: nobody → Alexey Stupnikov (astupnikov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/813659

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

I have used my lab to understand original report and what is actually going on. Shortly speaking, Nova is unable to live migrate an instance because Neutron fails to create port binding on destination host (it already exists) when Nova Conductor tries to create port binding on destination host.

There could be multiple causes of such behavior and this bug is not about solving a root cause. Instead, the problem is that it is hard to isolate the issue using Neutron Server logs.

When debug is enabled Neutron generates logs [1] for such kinds of requests. It looks like for some reason the following exception raised by Neutron is not logged properly on Neutron side: we can see it in Nova logs, but not in Neutron Server logs. https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/plugin.py#L2463-L2466

As a result, it looks like this problem should be solved by improving logging for Neutron extensions in general or this particular function.

With that being said, I am changing affected product to Neutron and unassigning this bug.

[1]
  2021-10-18 14:15:58.081 18 DEBUG neutron.api.v2.base [req-c237e714-0085-47c6-abab-e7e15cec7ea1 d79d0393fbb04564bc8fd3c62d290087 b8715d57125f4787a6701319d38f61e3 - default default] Request body: {'binding': {'host': 'compute-0.redhat.local', 'vnic_type': 'normal', 'profile': {}}} prepare_request_body /usr/lib/python3.6/site-packages/neutron/api/v2/base.py:719
  2021-10-18 14:15:58.081 18 DEBUG neutron.api.v2.base [req-c237e714-0085-47c6-abab-e7e15cec7ea1 d79d0393fbb04564bc8fd3c62d290087 b8715d57125f4787a6701319d38f61e3 - default default] Unknown quota resources ['binding']. _create /usr/lib/python3.6/site-packages/neutron/api/v2/base.py:490
  2021-10-18 14:15:58.135 18 INFO neutron.api.v2.resource [req-c237e714-0085-47c6-abab-e7e15cec7ea1 d79d0393fbb04564bc8fd3c62d290087 b8715d57125f4787a6701319d38f61e3 - default default] create failed (client error): There was a conflict when trying to complete your request.
  2021-10-18 14:15:58.137 18 INFO neutron.wsgi [req-c237e714-0085-47c6-abab-e7e15cec7ea1 d79d0393fbb04564bc8fd3c62d290087 b8715d57125f4787a6701319d38f61e3 - default default] 172.17.1.17 "POST /v2.0/ports/7542a977-0586-423a-ae35-86e3ff791060/bindings HTTP/1.1" status: 409 len: 364 time: 0.3800023
  2021-10-18 14:15:58.316 21 DEBUG neutron_lib.callbacks.manager [req-334ebddd-4e81-4c12-829c-64f3b0a278ff - - - - -] Notify callbacks ['neutron.services.segments.db._update_segment_host_mapping_for_agent-8793714910767', 'neutron.plugins.ml2.plugin.Ml2Plugin._retry_binding_revived_agents-16758855'] for agent, after_update _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193

Changed in nova:
assignee: Alexey Stupnikov (astupnikov) → nobody
affects: nova → neutron
Changed in neutron:
status: Triaged → New
Revision history for this message
Tobias Urdin (tobias-urdin) wrote (last edit ): RE:

edit: old

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.