cifsd deadlocks / CIFS related Oopses
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-aws (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
We're running a server at AWS which collects data from machines over CIFS. This involves a a lot of mounting and umounting of CIFS (about 100 targets with 2 shares each with 10 delay in between). The targets might sometimes become unavailable when they turned of for the weekend or rebooted.
The server doing this has to be rebooted every few hours because CIFS connection start to hang and don't recover. The usual symptom is:
Jul 24 10:12:59 connector kernel: [ 7765.705409] CIFS: Attempting to mount //172.22.
Jul 24 10:13:01 connector kernel: [ 7767.689258] CIFS: Attempting to mount //172.22.
Jul 24 10:13:06 connector kernel: [ 7772.758283] CIFS: Attempting to mount //172.30.
Jul 24 10:13:06 connector kernel: [ 7773.300475] CIFS: Attempting to mount //172.30.
Jul 24 10:13:09 connector kernel: [ 7776.364516] CIFS: Attempting to mount //172.30.
Jul 24 10:13:11 connector kernel: [ 7777.978731] CIFS: Attempting to mount //172.30.
[...]
Jul 24 10:16:13 connector kernel: [ 7960.390529] CIFS VFS: \\172.30.113.108 has not responded in 180 seconds. Reconnecting...
Jul 24 10:16:15 connector kernel: [ 7962.468649] CIFS VFS: \\172.30.93.171 has not responded in 180 seconds. Reconnecting...
Jul 24 10:16:18 connector kernel: [ 7964.999037] CIFS VFS: \\172.30.99.55 has not responded in 180 seconds. Reconnecting...
Jul 24 10:16:31 connector kernel: [ 7977.798821] INFO: task cifsd:26252 blocked for more than 120 seconds.
Jul 24 10:16:31 connector kernel: [ 7977.803730] Not tainted 5.4.0-1020-aws #20-Ubuntu
Jul 24 10:16:31 connector kernel: [ 7977.808526] "echo 0 > /proc/sys/
Jul 24 10:16:31 connector kernel: [ 7977.820291] cifsd D 0 26252 2 0x80004000
Jul 24 10:16:31 connector kernel: [ 7977.820298] Call Trace:
Jul 24 10:16:31 connector kernel: [ 7977.820307] __schedule+
Jul 24 10:16:31 connector kernel: [ 7977.820310] ? __switch_
Jul 24 10:16:31 connector kernel: [ 7977.820313] ? __switch_
Jul 24 10:16:31 connector kernel: [ 7977.820315] schedule+0x42/0xb0
Jul 24 10:16:31 connector kernel: [ 7977.820318] rwsem_down_
Jul 24 10:16:31 connector kernel: [ 7977.820321] down_read+0x85/0xa0
Jul 24 10:16:31 connector kernel: [ 7977.820324] iterate_
Jul 24 10:16:31 connector kernel: [ 7977.820411] ? cifs_set_
Jul 24 10:16:31 connector kernel: [ 7977.820429] cifs_reconnect+
Jul 24 10:16:31 connector kernel: [ 7977.820433] ? vprintk_
Jul 24 10:16:31 connector kernel: [ 7977.820449] cifs_readv_
Jul 24 10:16:31 connector kernel: [ 7977.820465] cifs_read_
Jul 24 10:16:31 connector kernel: [ 7977.820482] ? allocate_
Jul 24 10:16:31 connector kernel: [ 7977.820497] cifs_demultiple
Jul 24 10:16:31 connector kernel: [ 7977.820500] kthread+0x104/0x140
Jul 24 10:16:31 connector kernel: [ 7977.820516] ? cifs_handle_
Jul 24 10:16:31 connector kernel: [ 7977.820518] ? kthread_
Jul 24 10:16:31 connector kernel: [ 7977.820520] ret_from_
Jul 24 10:16:31 connector kernel: [ 7977.820524] INFO: task cifsd:26328 blocked for more than 120 seconds.
Jul 24 10:16:31 connector kernel: [ 7977.827503] Not tainted 5.4.0-1020-aws #20-Ubuntu
That is, cifsd gets stuck fetching credentials for the reconnect. I'm attaching the full syslog with stack traces from all hung cifsd task (I don't see where the deadlock is there).
The mounting/unmounting is done in a privileged Docker container. If we restart that, we usually run into an Oops:
Jul 25 07:43:29 connector kernel: [64677.164367] Oops: 0000 [#1] SMP NOPTI
Jul 25 07:43:29 connector kernel: [64677.164370] CPU: 0 PID: 265452 Comm: cifsd Not tainted 5.4.0-1020-aws #20-Ubuntu
Jul 25 07:43:29 connector kernel: [64677.164370] Hardware name: Amazon EC2 t3a.large/, BIOS 1.0 10/16/2017
Jul 25 07:43:29 connector kernel: [64677.164400] RIP: 0010:cifs_
Jul 25 07:43:29 connector kernel: [64677.164403] Code: e8 bb 43 0c d5 66 90 48 8b 45 c0 48 8d 55 c0 4c 8d 6d b8 48 39 c2 74 62 49 be 00 01 00 00 00 00 ad de 48 8b 45 c0 4c 8d 78 f
8 <48> 8b 00 48 8d 58 f8 4d 39 ef 74 3d 49 8b 57 10 48 89 50 08 48 89
Jul 25 07:43:29 connector kernel: [64677.218175] RSP: 0018:ffffbf25c0
Jul 25 07:43:29 connector kernel: [64677.222539] RAX: 0000000000000000 RBX: ffff9cdef66f0800 RCX: ffffffff95cd8510
Jul 25 07:43:29 connector kernel: [64677.227607] RDX: ffffbf25c0b27d30 RSI: ffffbf25c0b27d18 RDI: ffffffffc0aeec18
Jul 25 07:43:29 connector kernel: [64677.232638] RBP: ffffbf25c0b27d70 R08: 0000000000000180 R09: 0000000000000000
Jul 25 07:43:29 connector kernel: [64677.237666] R10: ffff9cdf32a173c8 R11: 0000000000000000 R12: 00000000fffffffe
Jul 25 07:43:29 connector kernel: [64677.242789] R13: ffffbf25c0b27d28 R14: dead000000000100 R15: fffffffffffffff8
Jul 25 07:43:29 connector kernel: [64677.247874] FS: 000000000000000
Jul 25 07:43:29 connector kernel: [64677.254956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 25 07:43:29 connector kernel: [64677.259348] CR2: 0000000000000000 CR3: 00000001cddce000 CR4: 00000000003406f0
Jul 25 07:43:29 connector kernel: [64677.264439] Call Trace:
Jul 25 07:43:29 connector kernel: [64677.267345] ? vprintk_
Jul 25 07:43:29 connector kernel: [64677.270720] cifs_readv_
Jul 25 07:43:29 connector kernel: [64677.274889] cifs_read_
Jul 25 07:43:29 connector kernel: [64677.278914] ? cifs_add_
Jul 25 07:43:29 connector kernel: [64677.282722] ? allocate_
Jul 25 07:43:29 connector kernel: [64677.286453] cifs_demultiple
Jul 25 07:43:29 connector kernel: [64677.290566] kthread+0x104/0x140
Jul 25 07:43:29 connector kernel: [64677.293969] ? cifs_handle_
Jul 25 07:43:29 connector kernel: [64677.298096] ? kthread_
Jul 25 07:43:29 connector kernel: [64677.301535] ret_from_
Jul 25 07:43:29 connector kernel: [64677.304799] Modules linked in: md4 nls_utf8 cifs libarc4 libdes rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache xt_nat veth vxlan ip
6_udp_tunnel udp_tunnel xt_policy iptable_mangle xt_mark xt_u32 xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_
iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc aufs overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ena serio_raw parport_pc parport sch_fq_codel drm i2c_core sunrpc ip_tables x_tables a
utofs4
Jul 25 07:43:29 connector kernel: [64677.387761] CR2: 0000000000000000
Jul 25 07:43:29 connector kernel: [64677.391027] ---[ end trace b498d70d7111f607 ]---
The mount options used are:
ro,relatime,
The attached log files also contain a bit of CIFS debug messages generated with:
echo 'module cifs +p' > /sys/kernel/
echo 'file fs/cifs/* +p' > /sys/kernel/
echo 1 > /proc/fs/
Is there any way of trying a newer kernel? https:/
ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-
ProcVersionSign
Uname: Linux 5.4.0-1020-aws x86_64
ApportVersion: 2.20.11-0ubuntu27.4
Architecture: amd64
CasperMD5CheckR
Date: Sat Jul 25 11:55:47 2020
Ec2AMI: ami-07d14b5d472
Ec2AMIManifest: (unknown)
Ec2Availability
Ec2InstanceType: t3a.large
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
XDG_RUNTIME_
LANG=C.UTF-8
SHELL=/usr/bin/zsh
SourcePackage: linux-aws
UpgradeStatus: No upgrade log present (probably fresh install)