[3.0.2.0-32~liberty] 100K DHCP Request: Agent stops responding and is in deadlock
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R3.0 |
Fix Committed
|
High
|
Hari Prasad Killi | |||
Trunk |
Fix Committed
|
High
|
Hari Prasad Killi |
Bug Description
While sending 100K DHCP request from BMS to TSN, TSN stops responding to DHCP request after some time.
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p514p1, link-type EN10MB (Ethernet), capture size 65535 bytes
08:35:30.379967 IP 32.32.32.32.7893 > 172.17.90.6.4789: VXLAN, flags [I] (0x08), vni 1021
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:03:01:01:02:f6, length 261
08:35:30.379973 IP 32.32.32.32.7893 > 172.17.90.6.4789: VXLAN, flags [I] (0x08), vni 1021
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:03:01:01:02:f6, length 261
root@5b7s6:~# vxlan --get 1021
VXLAN Table
VNID NextHop
----------------
1021 30
root@5b7s6:~# nh --get 30
Id:30 Type:Vrf_Translate Fmly: AF_INET Rid:0 Ref_cnt:2 Vrf:63
root@5b7s6:~# rt --dump 63 --family bridge
Flags: L=Label Valid, Df=DHCP flood
vRouter bridge table 0/63
Index DestMac Flags Label/VNID Nexthop
192464 0:0:5e:0:1:0 Df - 3
212412 ff:ff:ff:ff:ff:ff LDf 1021 178
380336 90:e2:ba:a7:32:24 Df - 3
531528 0:3:1:1:2:f6 L 1021 13
988496 0:3:1:1:2:f7 - 1
root@5b7s6:~#
root@5b7s6:~# nh --get 13
Id:13 Type:Tunnel Fmly: AF_INET Rid:0 Ref_cnt:252 Vrf:0
Oif:0 Len:14 Flags Valid, Vxlan, Data:0c 86 10 3c 2b 00 90 e2 ba a7 32 24 08 00
Vrf:0 Sip:172.17.90.6 Dip:32.32.32.32
root@5b7s6:~# dropstats | grep -v '0$'
Discards 951025
Cloned Original 5249517
Invalid NH 31436
Invalid Mcast Source 1830216
Duplicated 14
Invalid VNID 452397
No L2 Route 36811
root@5b7s6:~# dropstats | grep -v '0$'
Discards 951025
Cloned Original 5249517
Invalid NH 31436
Invalid Mcast Source 1830216
Duplicated 14
Invalid VNID 452397
No L2 Route 36811
(gdb) info thr
Id Target Id Frame
23 Thread 0x7f87c30cd700 (LWP 28327) "contrail-vroute" syscall () at ../sysdeps/
22 Thread 0x7f87c2ccc700 (LWP 28328) "contrail-vroute" 0x00007f87ca8383bd in read () at ../sysdeps/
21 Thread 0x7f87c28cb700 (LWP 28329) "contrail-vroute" __lll_lock_
20 Thread 0x7f87c24ca700 (LWP 28330) "contrail-vroute" syscall () at ../sysdeps/
19 Thread 0x7f87c20c9700 (LWP 28332) "contrail-vroute" syscall () at ../sysdeps/
18 Thread 0x7f87c1cc8700 (LWP 28331) "contrail-vroute" syscall () at ../sysdeps/
17 Thread 0x7f87c18c7700 (LWP 28333) "contrail-vroute" syscall () at ../sysdeps/
16 Thread 0x7f87c14c6700 (LWP 28334) "contrail-vroute" syscall () at ../sysdeps/
15 Thread 0x7f87c0ab0700 (LWP 28693) "contrail-vroute" syscall () at ../sysdeps/
14 Thread 0x7f87c06af700 (LWP 28694) "contrail-vroute" _int_malloc (av=0x7f8780000020, bytes=24) at malloc.c:3472
13 Thread 0x7f872e3ff700 (LWP 31642) "contrail-vroute" syscall () at ../sysdeps/
12 Thread 0x7f872dffe700 (LWP 31643) "contrail-vroute" syscall () at ../sysdeps/
11 Thread 0x7f86c66ff700 (LWP 1463) "contrail-vroute" syscall () at ../sysdeps/
10 Thread 0x7f86c62fe700 (LWP 1464) "contrail-vroute" __lll_lock_
9 Thread 0x7f85b25ff700 (LWP 7979) "contrail-vroute" syscall () at ../sysdeps/
8 Thread 0x7f85b21fe700 (LWP 7980) "contrail-vroute" syscall () at ../sysdeps/
7 Thread 0x7f85e3fff700 (LWP 26173) "contrail-vroute" syscall () at ../sysdeps/
6 Thread 0x7f85e3bfe700 (LWP 26174) "contrail-vroute" syscall () at ../sysdeps/
5 Thread 0x7f85c3bfe700 (LWP 2733) "contrail-vroute" syscall () at ../sysdeps/
4 Thread 0x7f85c37fd700 (LWP 2734) "contrail-vroute" syscall () at ../sysdeps/
3 Thread 0x7f85c3fff700 (LWP 2735) "contrail-vroute" syscall () at ../sysdeps/
2 Thread 0x7f85c33fc700 (LWP 2736) "contrail-vroute" syscall () at ../sysdeps/
* 1 Thread 0x7f87cc7ad7c0 (LWP 28308) "contrail-vroute" _int_malloc (av=0x7f87c9dce760 <main_arena>, bytes=9060) at malloc.c:3775
(gdb) thr 10
[Switching to thread 10 (Thread 0x7f86c62fe700 (LWP 1464))]
#0 __lll_lock_
95 ../nptl/
(gdb) bt
#0 __lll_lock_
#1 0x00007f87c9a94bcb in _L_lock_4651 () at malloc.c:5206
#2 0x00007f87c9a8f3e3 in _int_free (av=0x7f87c9dce760 <main_arena>, p=0xaa7708f480, have_lock=0) at malloc.c:3943
#3 0x0000000000c3142a in PacketBuffer:
#4 0x000000000082bb8e in boost::
#5 0x0000000000c31efc in PktInfo::~PktInfo() ()
#6 0x000000000082bb8e in boost::
#7 0x0000000000c4755e in tbb::strict_
#8 0x0000000000c476b6 in tbb::strict_
#9 0x0000000000c47fab in QueueTaskRunner
#10 0x000000000118d89c in TaskImpl::execute() ()
#11 0x00007f87ca615b3a in ?? () from /usr/lib/
#12 0x00007f87ca611816 in ?? () from /usr/lib/
#13 0x00007f87ca610f4b in ?? () from /usr/lib/
#14 0x00007f87ca60d0ff in ?? () from /usr/lib/
#15 0x00007f87ca60d2f9 in ?? () from /usr/lib/
#16 0x00007f87ca831182 in start_thread (arg=0x7f86c62f
#17 0x00007f87c9b0a47d in clone () at ../sysdeps/
(gdb) thr 21
[Switching to thread 21 (Thread 0x7f87c28cb700 (LWP 28329))]
#0 __lll_lock_
95 in ../nptl/
(gdb) bt
#0 __lll_lock_
#1 0x00007f87c9a94bcb in _L_lock_4651 () at malloc.c:5206
#2 0x00007f87c9a8f3e3 in _int_free (av=0x7f87c9dce760 <main_arena>, p=0x1c4b2bdbb0, have_lock=0) at malloc.c:3943
#3 0x0000000000c3142a in PacketBuffer:
#4 0x000000000082bb8e in boost::
#5 0x0000000000c31efc in PktInfo::~PktInfo() ()
#6 0x000000000082bb8e in boost::
#7 0x0000000000c47fd2 in QueueTaskRunner
#8 0x000000000118d89c in TaskImpl::execute() ()
#9 0x00007f87ca615b3a in ?? () from /usr/lib/
#10 0x00007f87ca611816 in ?? () from /usr/lib/
#11 0x00007f87ca610f4b in ?? () from /usr/lib/
#12 0x00007f87ca60d0ff in ?? () from /usr/lib/
#13 0x00007f87ca60d2f9 in ?? () from /usr/lib/
#14 0x00007f87ca831182 in start_thread (arg=0x7f87c28c
#15 0x00007f87c9b0a47d in clone () at ../sysdeps/
Changed in juniperopenstack: | |
importance: | Undecided → High |
assignee: | nobody → Hari Prasad Killi (haripk) |
information type: | Proprietary → Public |
summary: |
- [3.0.2.0-32~liberty] 100K DHCP Request: Agent stope responding and is in - dead lock + [3.0.2.0-32~liberty] 100K DHCP Request: Agent stops responding and is in + deadlock |
tags: | added: blocker |
There is no deadlock - DHCP requests were sent in at 100K per second, while they were being processed at 30K per second. This built up a large backlog which takes time to clear. Need to handle this case (rate control / discard beyond a threshold).