Bug #1556549 “too many qbr or qvo entries on compute node even t...” : Bugs : OpenStack Compute (nova)

Rahul Sharma (rahulsharmaait) on 2016-03-13

summary:	- too many qbr or qvo entries on compute node even though I have 2-3 + too many qbr or qvo entries on compute node even though I have 7-8 instances on that compute node
description:	updated

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2016-03-13:

#1

I am not sure why Neutron is involved here: hybrid bridge, as well as tap devices are managed by Nova solely.

tags:	added: ovs
Changed in neutron:
importance:	Undecided → Medium
tags:	added: needs-attention

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2016-03-13:

#2

Please at least provide debug logs for neutron-server and l2 agent used.

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-03-16:

#3

Is this RHEL OSP (the product)? If so, you probably need to open a bugzilla against Red Hat to start since I don't know what they have in their product that might not be in upstream stable/kilo.

Also, are you able to recreate this on stable/liberty or mitaka?

tags:	added: libvirt
Changed in nova:
status:	New → Incomplete

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-03-16:

#4

The full stacktrace should be getting logged:

https://github.com/openstack/nova/blob/stable/kilo/nova/network/linux_net.py#L1363

Which should include the stdout/stderr from the command to explain why it failed. That should be in the nova-compute logs, please provide that stacktrace/error message.

Revision history for this message

Rahul Sharma (rahulsharmaait) wrote on 2016-03-16:

#5

compute-42-logs.txt Edit (17.6 KiB, text/plain)

Hi Matt,

As I debugged further, the issue occurred because the openvswitch service was failing on that particular node with the error:
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n

However, since new instance creation requests were coming in again and again, it kept on creating qvb and qvo interfaces successfully but failed after that I guess. For example, in logs it states that it failed for this request:-

2016-03-08 01:34:45.083 4444 INFO nova.virt.libvirt.driver [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] [instance: ce125391-b07f-4100-8046-51b982c17553] Creating image
2016-03-08 01:35:03.595 4444 ERROR nova.network.linux_net [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] Unable to execute ['ovs-vsctl', '--timeout=120', '--', '--if-exists', 'del-port', u'qvo2188d93e-29', '--', 'add-port', 'br-int', u'qvo2188d93e-29', '--', 'set', 'Interface', u'qvo2188d93e-29', u'external-ids:iface-id=2188d93e-2945-4f11-80d8-525e8d81957b', 'external-ids:iface-status=active', u'external-ids:attached-mac=fa:16:3e:2d:51:19', 'external-ids:vm-uuid=ce125391-b07f-4100-8046-51b982c17553']. Exception: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo2188d93e-29 -- add-port br-int qvo2188d93e-29 -- set Interface qvo2188d93e-29 external-ids:iface-id=2188d93e-2945-4f11-80d8-525e8d81957b external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:2d:51:19 external-ids:vm-uuid=ce125391-b07f-4100-8046-51b982c17553
Exit code: 1
Stdout: u''
Stderr: u'ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n'
2016-03-08 01:35:03.596 4444 ERROR nova.compute.manager [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] [instance: ce125391-b07f-4100-8046-51b982c17553] Instance failed to spawn

However, if I check for qvo2188d93e-29, it is still present:-
[root@compute-42 rahul]# ifconfig qvo2188d93e-29
qvo2188d93e-29: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 9000
        inet6 fe80::dcf4:caff:fef0:8e5 prefixlen 64 scopeid 0x20<link>
        ether de:f4:ca:f0:08:e5 txqueuelen 1000 (Ethernet)
        RX packets 15 bytes 1206 (1.1 KiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 8 bytes 648 (648.0 B)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Due to this, the compute node ended up having more than 350+ qvo/qvb pairs. I added this bug since this behavior seems to mess up the compute node, though the openvswitch is unable to connect to the database. Also, neutron-agent is seen as up in this case.

Please find logs for nova-compute attached.

Hi Matt,

As I debugged further, the issue occurred because the openvswitch service was failing on that particular node with the error:
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n

However, since new instance creation requests were coming in again and again, it kept on creating qvb and qvo interfaces successfully but failed after that I guess. For example, in logs it states that it failed for this request:-

2016-03-08 01:34:45.083 4444 INFO nova.virt.libvirt.driver [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] [instance: ce125391-b07f-4100-8046-51b982c17553] Creating image
2016-03-08 01:35:03.595 4444 ERROR nova.network.linux_net [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] Unable to execute ['ovs-vsctl', '--timeout=120', '--', '--if-exists', 'del-port', u'qvo2188d93e-29', '--', 'add-port', 'br-int', u'qvo2188d93e-29', '--', 'set', 'Interface', u'qvo2188d93e-29', u'external-ids:iface-id=2188d93e-2945-4f11-80d8-525e8d81957b', 'external-ids:iface-status=active', u'external-ids:attached-mac=fa:16:3e:2d:51:19', 'external-ids:vm-uuid=ce125391-b07f-4100-8046-51b982c17553']. Exception: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo2188d93e-29 -- add-port br-int qvo2188d93e-29 -- set Interface qvo2188d93e-29 external-ids:iface-id=2188d93e-2945-4f11-80d8-525e8d81957b external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:2d:51:19 external-ids:vm-uuid=ce125391-b07f-4100-8046-51b982c17553
Exit code: 1
Stdout: u''
Stderr: u'ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n'
2016-03-08 01:35:03.596 4444 ERROR nova.compute.manager [req-c78b194d-6fa0-4cc2-8751-0c9d3fc43bea c665814ae07a4f71b666d04fcb99c2e9 a0288bedbb884e07bc0c602e7a343de8 - - -] [instance: ce125391-b07f-4100-8046-51b982c17553] Instance failed to spawn

However, if I check for qvo2188d93e-29, it is still present:-
[root@compute-42 rahul]# ifconfig qvo2188d93e-29
qvo2188d93e-29: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 9000
        inet6 fe80::dcf4:caff:fef0:8e5  prefixlen 64  scopeid 0x20<link>
        ether de:f4:ca:f0:08:e5  txqueuelen 1000  (Ethernet)
        RX packets 15  bytes 1206 (1.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 648 (648.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Due to this, the compute node ended up having more than 350+ qvo/qvb pairs. I added this bug since this behavior seems to mess up the compute node, though the openvswitch is unable to connect to the database. Also, neutron-agent is seen as up in this case.

Please find logs for nova-compute attached.

Revision history for this message

Rahul Sharma (rahulsharmaait) wrote on 2016-03-16:

#6

compute-openvswitch-agent-logs.txt Edit (194.1 KiB, text/plain)

Logs for openvswitch-agent on compute node.

Revision history for this message

Rahul Sharma (rahulsharmaait) wrote on 2016-03-17:

#7

neutron-server-logs.txt Edit (65.9 KiB, text/plain)

Neutron server logs.

Revision history for this message

Rahul Sharma (rahulsharmaait) wrote on 2016-03-19:

#8

I can reproduce this easily on devstack.

ubuntu@devstack:~/devstack$ git branch -a
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/master
  remotes/origin/stable/kilo
  remotes/origin/stable/liberty
ubuntu@devstack:~/devstack$

Here is the script to make openvswitch to fail:-

#!/bin/bash

for i in `seq 1 350`;
do
/usr/bin/ovsdb-client monitor Interface name,ofport,external_ids --format=json &
done

Once it spawns more than 330 ovsdb-client processes, openvswitch starts failing. Now, if you try to launch an instance, it errors out after some time with "no valid host" error. Even if you terminate the instance, you can see a leftover pair with two ends as qvo and qvb respectively. The more instances you launch, you will end up more number of veth pairs on your compute node.

root@devstack:/home/ubuntu# ifconfig | grep qvo
qvo4f07f296-d2 Link encap:Ethernet HWaddr 1e:bc:17:e4:3c:b8
root@devstack:/home/ubuntu#

Reboot of compute node is the only solution to get rid of those pairs, or removing them manually but it would be good if those stale entries are not left after the instance has failed and its removed.

Can you please let me know your views on the same?

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-03-21:

#9

The bug reporter provided a reproducer script. Switching back to "new".

Changed in nova:
status:	Incomplete → New

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-03-21:

#10

What version of openvswitch are you running? We should probably re-open the neutron part of this since it seems odd the ovs agent in neutron would still be up even if we can't connect to the database. It also seems like a bug in ovs if it's creating the interface but the command fails and it doesn't cleanup after itself. People on the neutron side might know more about that.

Changed in neutron:
status:	Incomplete → New

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-03-21:

#11

I'm wondering why the ovs db connection failure is happening, I saw something related here, in this case it looks like the compute node is running out of memory, are you having any issues like that on this node?

https://bugzilla.redhat.com/show_bug.cgi?id=1158701

Revision history for this message

Rahul Sharma (rahulsharmaait) wrote on 2016-03-21:

#12

ubuntu@devstack:~$ ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.4.0
Compiled Oct 16 2015 09:22:33
DB Schema 7.12.1
ubuntu@devstack:~$

Revision history for this message

Sean Dague (sdague) wrote on 2016-04-18:

#13

There is a reproduce script: Confirmed

Changed in nova:
status:	New → Confirmed
Changed in neutron:
status:	New → Confirmed

Revision history for this message

Ilya Chukhnakov (ichukhnakov) wrote on 2016-04-18:

#14

This seems to be an OVS limitation. See http://openvswitch.org/pipermail/dev/2016-February/066030.html
More specifically, the patch described on OVS mailing list removed the OVSDB limitation of 330 connections:
@@ -130,7 +130,6 @@ ovsdb_jsonrpc_server_create(void)
{
     struct ovsdb_jsonrpc_server *server = xzalloc(sizeof *server);
     ovsdb_server_init(&server->up);
- server->max_sessions = 330; /* Random limit. */
     shash_init(&server->remotes);
     return server;
}

Changed in neutron:
status:	Confirmed → Invalid

Markus Zoeller (markus_z) (mzoeller) on 2016-04-24

Changed in nova:
importance:	Undecided → Medium

Rahul Sharma (rahulsharmaait) on 2016-05-06

Changed in nova:
assignee:	nobody → Rahul Sharma (rahulsharmaait)

Takashi Natsume (natsume-takashi) on 2016-06-07

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-11-14:

#15

Hi, all, any updates for this bug?

We met this issue too, something like this:

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-27:

#16

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in nova:
status:	In Progress → Confirmed
assignee:	Rahul Sharma (rahulsharmaait) → nobody

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Confirmed	Medium	Unassigned
	neutron	Invalid	Medium	Unassigned

OpenStack Compute (nova)

too many qbr or qvo entries on compute node even though I have 7-8 instances on that compute node

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches