Ubuntu
linux package

possible deadlock while using the cgroup freezer on a container with NFS-based workload

Bug #1598285 reported by Tycho Andersen on 2016-07-01

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	In Progress	High	Seth Forshee

Bug Description

Hi guys,

For background: I'm running a container with an NFS filesystem bind mounted into it. The workload I'm running is iozone, a filesystem benchmarking tool. While running this workload, I attempt to freeze the container, which gets stuck in the FREEZING state. After a while, I get:

Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.104156] INFO: task iozone:20035 blocked for more than 120 seconds.
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.111056] Tainted: P O 4.4.0-24-generic #43-Ubuntu
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.118053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126110] iozone D ffff880015673e18 0 20035 20005 0x00000104
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126116] ffff880015673e18 ffff880000000010 ffff880045a21b80 ffff880037776e00
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126118] ffff880015674000 ffff8800179d6e54 ffff880037776e00 00000000ffffffff
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126120] ffff8800179d6e58 ffff880015673e30 ffffffff81821b15 ffff8800179d6e50
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126121] Call Trace:
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126129] [<ffffffff81821b15>] schedule+0x35/0x80
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126131] [<ffffffff81821dbe>] schedule_preempt_disabled+0xe/0x10
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126134] [<ffffffff818239f9>] __mutex_lock_slowpath+0xb9/0x130
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126136] [<ffffffff81823a8f>] mutex_lock+0x1f/0x30
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126139] [<ffffffff8121d00b>] do_unlinkat+0x12b/0x2d0
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126142] [<ffffffff8121dc16>] SyS_unlink+0x16/0x20
Jul 1 01:45:14 juju-19f8e3-15 kernel: [206520.126146] [<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71

It looks like the task is actually stuck in generic fs code, not anything NFS specific, but perhaps that's a relevant detail. Anyway:

ubuntu@juju-19f8e3-15:~$ sudo cat /proc/20035/stack
[<ffffffff8121d00b>] do_unlinkat+0x12b/0x2d0
[<ffffffff8121dc16>] SyS_unlink+0x16/0x20
[<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

The container and host are both xenial:

ubuntu@juju-19f8e3-15:~$ uname -a
Linux juju-19f8e3-15 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Finally, I don't have a good reproducer for this. It's pretty rare, as I'm running this benchmark in a loop, and over thousands of runs I've seen this exactly once.

I'll leave these hosts up for a bit if there's any other interesting bits of info to collect.

Revision history for this message

Brad Figg (brad-figg) wrote on 2016-07-01: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1598285

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Seth Forshee (sforshee) wrote on 2016-07-01:

We have two processes hung, seemingly waiting on i_mutex, pids 20035 and 20036. Pid 20032 is frozen with the following stack trace:

[<ffffffff810e9cfa>] __refrigerator+0x7a/0x140
[<ffffffffc08e80b8>] nfs4_handle_exception+0x118/0x130 [nfsv4]
[<ffffffffc08e9efd>] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
[<ffffffffc088a329>] nfs_unlink+0x149/0x350 [nfs]
[<ffffffff81219bd1>] vfs_unlink+0xf1/0x1a0
[<ffffffff8121d159>] do_unlinkat+0x279/0x2d0
[<ffffffff8121dc16>] SyS_unlink+0x16/0x20
[<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

which is suspicious. All three processes are from iozone.

Daniel Westervelt (danwest) on 2016-07-06

Changed in linux (Ubuntu):
importance:	Undecided → High

Revision history for this message

Seth Forshee (sforshee) wrote on 2016-07-06:

I've sent an inquiry to the upstream maintainers for assistance. I've also taken a stab at a fix, which I think should prevent the hang, but I'm not sure whether or not it might cause other problems. The patch and test build are here:

http://people.canonical.com/~sforshee/lp1598285/

I've currently got a setup running trying to reproduce the bug, once I've confirmed I've reproduced it I can test my fix.

Changed in linux (Ubuntu):
assignee:	nobody → Seth Forshee (sforshee)
status:	Incomplete → In Progress

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

possible deadlock while using the cgroup freezer on a container with NFS-based workload

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
linux package