recvfrom SYSCALL infinite loop/deadlock chewing 100% CPU (MSG_PEEK|MSG_WAITALL)
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Linux |
Unknown
|
Unknown
|
||||
linux (Ubuntu) |
Fix Released
|
High
|
Joseph Salisbury | |||
Trusty |
Fix Released
|
High
|
Joseph Salisbury | |||
Vivid |
Fix Released
|
High
|
Joseph Salisbury | |||
Wily |
Fix Released
|
High
|
Joseph Salisbury | |||
linux-lts-utopic (Ubuntu) | ||||||
Trusty |
Fix Released
|
High
|
Unassigned |
Bug Description
In a multi-threaded pthreads process running on Ubuntu 14.04 AMD64 (with over 1000 threads) which uses real time FIFO scheduling, we occasionally see calls to recv() with flags (MSG_PEEK | MSG_WAITALL) get stuck in an infinte loop or deadlock meaning the threads lock up chewing as much CPU as they can (due to FIFO scheduling) while stuck inside recv().
Here's an example gdb back trace:
[Switching to thread 4 (Thread 0x7f6040546700 (LWP 27251))]
#0 0x00007f6231d2f7eb in __libc_recv (fd=fd@entry=146, buf=buf@
33 ../sysdeps/
(gdb) bt
#0 0x00007f6231d2f7eb in __libc_recv (fd=fd@entry=146, buf=buf@
#1 0x0000000000421945 in recv (__flags=258, __n=5, __buf=0x7f60405
[snip]
The socket is a TCP socket in blocking mode, the recv() call is inside an outer loop with a counter, and I've checked the counter with gdb and it's always at 1, meaning that I'm sure that the outer loop isn't the problem, the thread is indeed deadlocked inside the recv() internals.
Other nodes:
* There always seems to be 2 or more threads deadlocked in the same place (same recv() call but with distinct FDs)
* The threads calling recv() have cancellation disbaled by previously executing: thread_
I've even tried adding a poll() call for POLLRDNORM on the socket before calling recv() with MSG_PEEK | MSG_WAITALL flags to try to make sure there's data available on the socket before calling *recv()*, but it makes no difference.
So, I don't know what is wrong here, I've read all the recv() documentation and believe that recv() is being used correctly, the only conclusion I can come to is that there is a bug in libc recv() when using flags MSG_PEEK | MSG_WAITALL with thousands of pthreads running.
===
break-fix: - dfbafc995304ebb
Related branches
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
status: | New → Triaged |
no longer affects: | linux-lts-utopic (Ubuntu Wily) |
no longer affects: | linux-lts-utopic (Ubuntu Vivid) |
Changed in linux-lts-utopic (Ubuntu Trusty): | |
status: | New → Fix Committed |
Changed in linux (Ubuntu Wily): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Vivid): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Trusty): | |
status: | In Progress → Fix Committed |
tags: | added: kernel-bug-break-fix |
description: | updated |
Changed in linux-lts-utopic (Ubuntu): | |
status: | New → Fix Committed |
importance: | Undecided → High |
Changed in linux-lts-utopic (Ubuntu Trusty): | |
importance: | Undecided → High |
no longer affects: | linux-lts-utopic (Ubuntu) |
tags: |
added: verification-done-vivid removed: verification-needed-vivid |
tags: |
added: verification-done-trusty removed: verification-needed-trusty |
tags: | removed: kernel-bug-break-fix |
According to the upstream bug:
"This bug is now fixed in the net tree: /git.kernel. org/cgit/ linux/kernel/ git/davem/ net.git/ commit/ ?id=dfbafc99530 4ebb9a9b03f6508 3e6e9cea143b20"
https:/
This commit is already applied to mainline:
$ git describe --contains dfbafc995304ebb 9a9b03f65083e6e 9cea143b20
v4.2-rc5~9^2~26
commit dfbafc995304ebb 9a9b03f65083e6e 9cea143b20
Author: Sabrina Dubroca <email address hidden>
Date: Fri Jul 24 18:19:25 2015 +0200
tcp: fix recv with flags MSG_WAITALL | MSG_PEEK