too many qbr or qvo entries on compute node even though I have 7-8 instances on that compute node

Bug #1556549 reported by Rahul Sharma
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned
neutron
Invalid
Medium
Unassigned

Bug Description

I am seeing this weird behavior in our production environment. Right now, we are seeing an issue where launching of an instance is failing since the compute node and neutron is not cleaning up the qbr or qvo it had created even after we try to terminate the failed instance. Here are the logs from nova-conductor:-
2016-03-08 01:35:49.478 14041 ERROR nova.scheduler.utils [req-6ec7ee4b-9663-4f1b-910a-a87d99ac941c c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] [instance: fa9c27b4-06dd-4c04-9647-44e1fb8c1a81] Error from last host: compute-42 (node compute-42): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2254, in _do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2400, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance fa9c27b4-06dd-4c04-9647-44e1fb8c1a81 was re-scheduled: Error during following call to agent: ['ovs-vsctl', '--timeout=120', '--', '--if-exists', 'del-port', u'qvo3e44fa11-05', '--', 'add-port', 'br-int', u'qvo3e44fa11-05', '--', 'set', 'Interface', u'qvo3e44fa11-05', u'external-ids:iface-id=3e44fa11-05b5-44dc-8c0c-6b937fe7abe0', 'external-ids:iface-status=active', u'external-ids:attached-mac=fa:16:3e:60:aa:5e', 'external-ids:vm-uuid=fa9c27b4-06dd-4c04-9647-44e1fb8c1a81']\n"]

This qvo still exists on the compute node:-
[root@compute-42 rahul]# ifconfig | grep qvo3e44fa11-05
qvo3e44fa11-05: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 9000 <----- this still exists
[root@compute-42 rahul]# ifconfig | grep qvo | wc -l
392 <------------------------ there are about 350+ such entries
[root@compute-42 rahul]# ifconfig | grep tap | wc -l
8 <----------------------- the compute is running only 8 instances, still more than 350+ entries for qvo-XX alone
[root@compute-42 rahul]#

I am running Kilo release and RHEL 7 Openstack rpms.

Expected:-
Shouldn't the qvo and qvb be deleted if creation of instance has failed?

summary: - too many qbr or qvo entries on compute node even though I have 2-3
+ too many qbr or qvo entries on compute node even though I have 7-8
instances on that compute node
description: updated
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I am not sure why Neutron is involved here: hybrid bridge, as well as tap devices are managed by Nova solely.

tags: added: ovs
Changed in neutron:
importance: Undecided → Medium
tags: added: needs-attention
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Please at least provide debug logs for neutron-server and l2 agent used.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :

Is this RHEL OSP (the product)? If so, you probably need to open a bugzilla against Red Hat to start since I don't know what they have in their product that might not be in upstream stable/kilo.

Also, are you able to recreate this on stable/liberty or mitaka?

tags: added: libvirt
Changed in nova:
status: New → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :

The full stacktrace should be getting logged:

https://github.com/openstack/nova/blob/stable/kilo/nova/network/linux_net.py#L1363

Which should include the stdout/stderr from the command to explain why it failed. That should be in the nova-compute logs, please provide that stacktrace/error message.

Revision history for this message
Rahul Sharma (rahulsharmaait) wrote :

Hi Matt,

As I debugged further, the issue occurred because the openvswitch service was failing on that particular node with the error:
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n

However, since new instance creation requests were coming in again and again, it kept on creating qvb and qvo interfaces successfully but failed after that I guess. For example, in logs it states that it failed for this request:-

2016-03-08 01:34:45.083 4444 INFO nova.virt.libvirt.driver [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] [instance: ce125391-b07f-4100-8046-51b982c17553] Creating image
2016-03-08 01:35:03.595 4444 ERROR nova.network.linux_net [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] Unable to execute ['ovs-vsctl', '--timeout=120', '--', '--if-exists', 'del-port', u'qvo2188d93e-29', '--', 'add-port', 'br-int', u'qvo2188d93e-29', '--', 'set', 'Interface', u'qvo2188d93e-29', u'external-ids:iface-id=2188d93e-2945-4f11-80d8-525e8d81957b', 'external-ids:iface-status=active', u'external-ids:attached-mac=fa:16:3e:2d:51:19', 'external-ids:vm-uuid=ce125391-b07f-4100-8046-51b982c17553']. Exception: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo2188d93e-29 -- add-port br-int qvo2188d93e-29 -- set Interface qvo2188d93e-29 external-ids:iface-id=2188d93e-2945-4f11-80d8-525e8d81957b external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:2d:51:19 external-ids:vm-uuid=ce125391-b07f-4100-8046-51b982c17553
Exit code: 1
Stdout: u''
Stderr: u'ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n'
2016-03-08 01:35:03.596 4444 ERROR nova.compute.manager [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] [instance: ce125391-b07f-4100-8046-51b982c17553] Instance failed to spawn

However, if I check for qvo2188d93e-29, it is still present:-
[root@compute-42 rahul]# ifconfig qvo2188d93e-29
qvo2188d93e-29: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 9000
        inet6 fe80::dcf4:caff:fef0:8e5 prefixlen 64 scopeid 0x20<link>
        ether de:f4:ca:f0:08:e5 txqueuelen 1000 (Ethernet)
        RX packets 15 bytes 1206 (1.1 KiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 8 bytes 648 (648.0 B)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Due to this, the compute node ended up having more than 350+ qvo/qvb pairs. I added this bug since this behavior seems to mess up the compute node, though the openvswitch is unable to connect to the database. Also, neutron-agent is seen as up in this case.

Please find logs for nova-compute attached.

Revision history for this message
Rahul Sharma (rahulsharmaait) wrote :

Logs for openvswitch-agent on compute node.

Revision history for this message
Rahul Sharma (rahulsharmaait) wrote :

Neutron server logs.

Revision history for this message
Rahul Sharma (rahulsharmaait) wrote :

I can reproduce this easily on devstack.

ubuntu@devstack:~/devstack$ git branch -a
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/master
  remotes/origin/stable/kilo
  remotes/origin/stable/liberty
ubuntu@devstack:~/devstack$

Here is the script to make openvswitch to fail:-

#!/bin/bash

for i in `seq 1 350`;
do
    /usr/bin/ovsdb-client monitor Interface name,ofport,external_ids --format=json &
done

Once it spawns more than 330 ovsdb-client processes, openvswitch starts failing. Now, if you try to launch an instance, it errors out after some time with "no valid host" error. Even if you terminate the instance, you can see a leftover pair with two ends as qvo and qvb respectively. The more instances you launch, you will end up more number of veth pairs on your compute node.

root@devstack:/home/ubuntu# ifconfig | grep qvo
qvo4f07f296-d2 Link encap:Ethernet HWaddr 1e:bc:17:e4:3c:b8
root@devstack:/home/ubuntu#

Reboot of compute node is the only solution to get rid of those pairs, or removing them manually but it would be good if those stale entries are not left after the instance has failed and its removed.

Can you please let me know your views on the same?

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

The bug reporter provided a reproducer script. Switching back to "new".

Changed in nova:
status: Incomplete → New
Revision history for this message
Matt Riedemann (mriedem) wrote :

What version of openvswitch are you running? We should probably re-open the neutron part of this since it seems odd the ovs agent in neutron would still be up even if we can't connect to the database. It also seems like a bug in ovs if it's creating the interface but the command fails and it doesn't cleanup after itself. People on the neutron side might know more about that.

Changed in neutron:
status: Incomplete → New
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm wondering why the ovs db connection failure is happening, I saw something related here, in this case it looks like the compute node is running out of memory, are you having any issues like that on this node?

https://bugzilla.redhat.com/show_bug.cgi?id=1158701

Revision history for this message
Rahul Sharma (rahulsharmaait) wrote :

ubuntu@devstack:~$ ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.4.0
Compiled Oct 16 2015 09:22:33
DB Schema 7.12.1
ubuntu@devstack:~$

Revision history for this message
Sean Dague (sdague) wrote :

There is a reproduce script: Confirmed

Changed in nova:
status: New → Confirmed
Changed in neutron:
status: New → Confirmed
Revision history for this message
Ilya Chukhnakov (ichukhnakov) wrote :

This seems to be an OVS limitation. See http://openvswitch.org/pipermail/dev/2016-February/066030.html
More specifically, the patch described on OVS mailing list removed the OVSDB limitation of 330 connections:
@@ -130,7 +130,6 @@ ovsdb_jsonrpc_server_create(void)
 {
     struct ovsdb_jsonrpc_server *server = xzalloc(sizeof *server);
     ovsdb_server_init(&server->up);
- server->max_sessions = 330; /* Random limit. */
     shash_init(&server->remotes);
     return server;
 }

Changed in neutron:
status: Confirmed → Invalid
Changed in nova:
importance: Undecided → Medium
Changed in nova:
assignee: nobody → Rahul Sharma (rahulsharmaait)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
LIU Yulong (dragon889) wrote :

Hi, all, any updates for this bug?

We met this issue too, something like this:

[root@compute-108-18 nova]# ip a|grep qvo|wc -l
38
[root@compute-108-18 nova]# ip a|grep qvb|wc -l
38
[root@compute-108-18 nova]# ip a|grep tab|wc -l
0

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Rahul Sharma (rahulsharmaait) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.