neutron-keepalived-state-change file descriptor leak

Bug #1907411 reported by Junbo Jiang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
High
Unassigned

Bug Description

the https://bugs.launchpad.net/neutron/+bug/1870313 fix the code to use threading to send garp, well the garp works, but it also introduced another very seriously bug file descriptor leak!

I tested the train, ussuri branches, both can reproduce the bug, the reproduce steps is simple, just create floating ip with --port option, which trigger the l3-agent to configure ip address on qg- interface, then the neutron-keepalived-state-change will send garp, AND there is file named "anon_inode:[eventpoll]" left in /proc/<pid of neutron-keepalived-state-change >/fd.

AS you can imagine, frequently create floating ip and delete floating ip, what will happen in /proc/X/fd

this could also cause the process neutron-keepalived-state-change consume huge memory like 10G+

Junbo Jiang (junbo)
Changed in neutron:
assignee: nobody → junbo (junbo)
Junbo Jiang (junbo)
description: updated
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Recently we removed that sending of garps from the Neutron's code. See https://github.com/openstack/neutron/commit/8e06d1d1eb57e35e6a51ecfe5a76ffa18b5e8c62
So this bug can affect only stable branches up to Victoria.
I'm not sure if we can backport https://github.com/openstack/neutron/commit/8e06d1d1eb57e35e6a51ecfe5a76ffa18b5e8c62 to older branches really so if You have easier way to fix that in stable branches that may be the way to go here.

Changed in neutron:
importance: Undecided → High
tags: added: l3-ha
Revision history for this message
Junbo Jiang (junbo) wrote :

hi slaweq, please see the patch, that's the easier way to fix bug.

https://review.opendev.org/c/openstack/neutron/+/766167

description: updated
Changed in neutron:
status: New → In Progress
Revision history for this message
LIU Yulong (dragon889) wrote :
Download full text (13.7 KiB)

Could not reproduce the issue on centos 7 with the fix of https://bugs.launchpad.net/neutron/+bug/1870313 [1].
Yes, it is a python2.7 environment.

[1] https://review.opendev.org/c/openstack/neutron/+/716944

# ps -ef|grep b247f145-569a-4d5a-bdd8-31a5213641ea
yulong 21844 1 0 09:25 ? 00:00:00 haproxy -f /opt/stack/data/neutron/ns-metadata-proxy/b247f145-569a-4d5a-bdd8-31a5213641ea.conf
yulong 21858 1 0 09:25 ? 00:00:00 /usr/bin/python /usr/bin/neutron-keepalived-state-change --router_id=b247f145-569a-4d5a-bdd8-31a5213641ea --namespace=snat-b247f145-569a-4d5a-bdd8-31a5213641ea --conf_dir=/opt/stack/data/neutron/ha_confs/b247f145-569a-4d5a-bdd8-31a5213641ea --monitor_interface=ha-5d9bf0de-1a --monitor_cidr=169.254.0.72/24 --pid_file=/opt/stack/data/neutron/external/pids/b247f145-569a-4d5a-bdd8-31a5213641ea.monitor.pid --state_path=/opt/stack/data/neutron --user=1000 --group=1000
root 22309 1 0 09:25 ? 00:00:00 keepalived -P -f /opt/stack/data/neutron/ha_confs/b247f145-569a-4d5a-bdd8-31a5213641ea/keepalived.conf -p /opt/stack/data/neutron/ha_confs/b247f145-569a-4d5a-bdd8-31a5213641ea.pid -r /opt/stack/data/neutron/ha_confs/b247f145-569a-4d5a-bdd8-31a5213641ea.pid-vrrp -D
root 22310 22309 0 09:25 ? 00:00:00 keepalived -P -f /opt/stack/data/neutron/ha_confs/b247f145-569a-4d5a-bdd8-31a5213641ea/keepalived.conf -p /opt/stack/data/neutron/ha_confs/b247f145-569a-4d5a-bdd8-31a5213641ea.pid -r /opt/stack/data/neutron/ha_confs/b247f145-569a-4d5a-bdd8-31a5213641ea.pid-vrrp -D
yulong 22345 1 0 09:26 ? 00:00:00 radvd -C /opt/stack/data/neutron/ra/b247f145-569a-4d5a-bdd8-31a5213641ea.radvd.conf -p /opt/stack/data/neutron/external/pids/b247f145-569a-4d5a-bdd8-31a5213641ea.pid.radvd -m syslog -u yulong
yulong 22346 22345 0 09:26 ? 00:00:00 radvd -C /opt/stack/data/neutron/ra/b247f145-569a-4d5a-bdd8-31a5213641ea.radvd.conf -p /opt/stack/data/neutron/external/pids/b247f145-569a-4d5a-bdd8-31a5213641ea.pid.radvd -m syslog -u yulong
root 22522 22210 0 09:31 pts/0 00:00:00 grep --color=auto b247f145-569a-4d5a-bdd8-31a5213641ea
[root@network2 ~]# ls /proc/21858/fd
0 1 10 11 12 13 15 2 24 3 4 5 6 7 8 9
[root@network2 ~]#
[root@network2 ~]#
[root@network2 ~]# lsof -p 21858
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
neutron-k 21858 yulong cwd DIR 253,0 244 64 /
neutron-k 21858 yulong rtd DIR 253,0 244 64 /
neutron-k 21858 yulong txt REG 253,0 7144 175436 /usr/bin/python2.7
neutron-k 21858 yulong mem REG 253,0 39848 319610 /usr/lib64/python2.7/lib-dynload/bz2.so
neutron-k 21858 yulong mem REG 253,0 174576 33588996 /usr/lib64/libtinfo.so.5.9
neutron-k 21858 yulong mem REG 253,0 285136 33589164 /usr/lib64/libreadline.so.6.2
neutron-k 21858 yulong mem REG 253,0 28392 319628 /usr/lib64/python2.7/lib-dynload/readline.so
neutron-k 21858 yulong mem REG 253,0 455585 34912648 /usr/lib64/python2.7/site-packages/msgpack/_unpacker.so
neutron-k 21858 yulong mem REG 253,0 995840 33588969 /usr/lib64/libstdc++.so.6.0.19
n...

Revision history for this message
LIU Yulong (dragon889) wrote :

Sorry, remove the "not" in comment 3. It is reproducible. There is a fd:
neutron-k 21858 yulong 7u a_inode 0,10 0 6396 [eventpoll]

Revision history for this message
LIU Yulong (dragon889) wrote :

But after 100 times tests:
fip_id=`openstack floating ip create public -c id -f value`
openstack floating ip set --port 6de8c3eb-3a03-45f5-ae3e-933723dc7eff $fip_id
openstack floating ip delete $fip_id

The fd size of that [eventpoll], the number of such [eventpoll] and the total memory of the neutron-keepalived-state-change all works fine.

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21858 yulong 20 0 383572 72756 1976 S 0.0 1.9 0:00.04 neutron-keepali

# lsof -p 21858|grep eventpoll
neutron-k 21858 yulong 7u a_inode 0,10 0 6396 [eventpoll]

So am I missing something?

Revision history for this message
Junbo Jiang (junbo) wrote :

Hi Yulong,

100 times is not enough, to reproduce the memory issue, you need first to make the fd descriptors to reach the limit of maximum open files(ulimit -n), In my test lab, it is 1024.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I've adapted the no longer existing "TestMonitorDaemon.test_new_fip_sends_garp" [1] to test this possible problem. In [2] you have the new test to add up to 1000 new IPs in the router port. Each new IPs triggers a new GARP being sent [2].

During the test (it takes ~500 seconds), I checked the open files for this keepalived_state_change process:
$ watch -n1 -d "lsof -p $pid | ag inode; echo ""; echo "Total lsof entries"; lsof -p $pid | wc"

I see how the number of files open increase while adding new IPs, but eventually goes down to the initial number, 88 in my test, once the GARP threads end. It takes time, but the number of open files decrease.

I see the a_inode entry but this number is not increasing:
/usr/bin/ 145787 root 6u a_inode 0,14 0 11424 [eventpoll]
/usr/bin/ 145787 root 7u a_inode 0,14 0 11424 [eventpoll]

And about the statement done in [3], you are right the GARP thread is spawn using threads and the cmd is executed using eventlet. But there is no problem to mix both: you can have a pool of greenthreads per system thread, that's totally fine. In this case, the cmd execution is done using a user thread and that should not provoke the increase of the open files. I don't see a relation between the problem reported and the use of both types of threads.

Regards.

[1]https://github.com/openstack/neutron/blob/stable/train/neutron/tests/functional/agent/l3/test_keepalived_state_change.py#L109
[2]http://paste.openstack.org/show/801128/
[3]https://review.opendev.org/c/openstack/neutron/+/766167/3//COMMIT_MSG

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: junbo (junbo) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/766167
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.