possible deadlock while using the cgroup freezer on a container with NFS-based workload

Bug #1598285 reported by Tycho Andersen
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
High
Seth Forshee

Bug Description

Hi guys,

For background: I'm running a container with an NFS filesystem bind mounted into it. The workload I'm running is iozone, a filesystem benchmarking tool. While running this workload, I attempt to freeze the container, which gets stuck in the FREEZING state. After a while, I get:

Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.104156] INFO: task iozone:20035 blocked for more than 120 seconds.
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.111056] Tainted: P O 4.4.0-24-generic #43-Ubuntu
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.118053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126110] iozone D ffff880015673e18 0 20035 20005 0x00000104
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126116] ffff880015673e18 ffff880000000010 ffff880045a21b80 ffff880037776e00
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126118] ffff880015674000 ffff8800179d6e54 ffff880037776e00 00000000ffffffff
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126120] ffff8800179d6e58 ffff880015673e30 ffffffff81821b15 ffff8800179d6e50
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126121] Call Trace:
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126129] [<ffffffff81821b15>] schedule+0x35/0x80
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126131] [<ffffffff81821dbe>] schedule_preempt_disabled+0xe/0x10
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126134] [<ffffffff818239f9>] __mutex_lock_slowpath+0xb9/0x130
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126136] [<ffffffff81823a8f>] mutex_lock+0x1f/0x30
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126139] [<ffffffff8121d00b>] do_unlinkat+0x12b/0x2d0
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126142] [<ffffffff8121dc16>] SyS_unlink+0x16/0x20
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126146] [<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71

It looks like the task is actually stuck in generic fs code, not anything NFS specific, but perhaps that's a relevant detail. Anyway:

ubuntu@juju-19f8e3-15:~$ sudo cat /proc/20035/stack
[<ffffffff8121d00b>] do_unlinkat+0x12b/0x2d0
[<ffffffff8121dc16>] SyS_unlink+0x16/0x20
[<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

The container and host are both xenial:

ubuntu@juju-19f8e3-15:~$ uname -a
Linux juju-19f8e3-15 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Finally, I don't have a good reproducer for this. It's pretty rare, as I'm running this benchmark in a loop, and over thousands of runs I've seen this exactly once.

I'll leave these hosts up for a bit if there's any other interesting bits of info to collect.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1598285

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Seth Forshee (sforshee) wrote :

We have two processes hung, seemingly waiting on i_mutex, pids 20035 and 20036. Pid 20032 is frozen with the following stack trace:

[<ffffffff810e9cfa>] __refrigerator+0x7a/0x140
[<ffffffffc08e80b8>] nfs4_handle_exception+0x118/0x130 [nfsv4]
[<ffffffffc08e9efd>] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
[<ffffffffc088a329>] nfs_unlink+0x149/0x350 [nfs]
[<ffffffff81219bd1>] vfs_unlink+0xf1/0x1a0
[<ffffffff8121d159>] do_unlinkat+0x279/0x2d0
[<ffffffff8121dc16>] SyS_unlink+0x16/0x20
[<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

which is suspicious. All three processes are from iozone.

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Seth Forshee (sforshee) wrote :

I've sent an inquiry to the upstream maintainers for assistance. I've also taken a stab at a fix, which I think should prevent the hang, but I'm not sure whether or not it might cause other problems. The patch and test build are here:

http://people.canonical.com/~sforshee/lp1598285/

I've currently got a setup running trying to reproduce the bug, once I've confirmed I've reproduced it I can test my fix.

Changed in linux (Ubuntu):
assignee: nobody → Seth Forshee (sforshee)
status: Incomplete → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.