ISST-LTE:pNV: system ben is hung during ST (nvme)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Xenial |
Fix Released
|
Undecided
|
Tim Gardner | ||
Yakkety |
Fix Released
|
High
|
Unassigned |
Bug Description
On when we are running I/O intensive tasks and CPU addition/removal, the block may hang stalling the entire machine.
The backtrace below is one of the symptoms:
[12747.111149] ---[ end trace b4d8d720952460b5 ]---
[12747.126885] Trying to free IRQ 357 from IRQ context!
[12747.146930] ------------[ cut here ]------------
[12747.166674] WARNING: at /build/
[12747.184069] Modules linked in: minix nls_iso8859_1 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) mlx4_en(OE) mlx4_core(OE) binfmt_misc xfs joydev input_leds mac_hid ofpart cmdlinepart powernv_flash ipmi_powernv mtd ipmi_msghandler at24 opal_prd powernv_rng ibmpowernv uio_pdrv_genirq uio sunrpc knem(OE) autofs4 btrfs xor raid6_pq hid_generic usbhid hid uas usb_storage nouveau ast bnx2x i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core(OE) ahci drm mdio libcrc32c mlx_compat(OE) libahci vxlan nvme ip6_udp_tunnel udp_tunnel
[12747.349013] CPU: 80 PID: 0 Comm: swapper/80 Tainted: G W OEL 4.4.0-21-generic #37-Ubuntu
[12747.369046] task: c000000f1fab89b0 ti: c000000f1fb6c000 task.ti: c000000f1fb6c000
[12747.404848] NIP: c000000000131888 LR: c000000000131884 CTR: 00000000300303f0
[12747.808333] REGS: c000000f1fb6e550 TRAP: 0700 Tainted: G W OEL (4.4.0-21-generic)
[12747.867658] MSR: 9000000100029033 <SF,HV,
[12747.884783] CFAR: c000000000aea8f4 SOFTE: 1
GPR00: c000000000131884 c000000f1fb6e7d0 c0000000015b4200 0000000000000028
GPR04: c000000f2a409c50 c000000f2a41b4e0 0000000f29480000 00000000000033da
GPR08: 0000000000000007 c000000000f8b27c 0000000f29480000 9000000100001003
GPR12: 0000000000002200 c000000007b6f800 c000000f2a40a938 0000000000000100
GPR16: c000000f11480000 0000000000003a98 0000000000000000 0000000000000000
GPR20: 0000000000000000 d000000009521008 d0000000095146a0 fffffffffffff000
GPR24: c000000004a19ef0 0000000000000000 0000000000000003 000000000000007d
GPR28: 0000000000000165 c000000eefeb1800 c000000eef830600 0000000000000165
[12748.243270] NIP [c000000000131888] __free_
[12748.254089] LR [c000000000131884] __free_
[12748.269738] Call Trace:
[12748.286740] [c000000f1fb6e7d0] [c000000000131884] __free_
[12748.289687] [c000000f1fb6e860] [c000000000131af8] free_irq+0x88/0xb0
[12748.304594] [c000000f1fb6e890] [d000000009514528] nvme_suspend_
[12748.333825] [c000000f1fb6e8c0] [d00000000951681c] nvme_dev_
[12748.340913] [c000000f1fb6e9a0] [d000000009516ae4] nvme_timeout+
[12748.357136] [c000000f1fb6ea60] [c000000000548a34] blk_mq_
[12748.383939] [c000000f1fb6ead0] [c00000000054c540] bt_for_
[12748.399292] [c000000f1fb6eb40] [c00000000054d4e8] blk_mq_
[12748.402665] [c000000f1fb6eb90] [c000000000547358] blk_mq_
[12748.438649] [c000000f1fb6ebd0] [c00000000014a13c] call_timer_
[12748.468126] [c000000f1fb6ec60] [c00000000014a5fc] run_timer_
[12748.483367] [c000000f1fb6ed30] [c0000000000beb78] __do_softirq+
[12748.498378] [c000000f1fb6ee20] [c0000000000bf048] irq_exit+0xc8/0x100
[12748.501048] [c000000f1fb6ee40] [c00000000001f954] timer_interrupt
[12748.516377] [c000000f1fb6ee70] [c000000000002714] decrementer_
[12748.547282] --- interrupt: 901 at arch_local_
[12748.547282] LR = arch_local_
[12748.574141] [c000000f1fb6f160] [0000000000000001] 0x1 (unreliable)
[12748.592405] [c000000f1fb6f180] [c000000000aedc3c] dump_stack+
[12748.596461] [c000000f1fb6f1c0] [c0000000001006fc] dequeue_
[12748.611532] [c000000f1fb6f230] [c0000000000f6080] deactivate_
[12748.627685] [c000000f1fb6f270] [c000000000adcb10] __schedule+
[12748.654416] [c000000f1fb6f300] [c000000000add0a8] schedule+0x48/0xc0
[12748.670558] [c000000f1fb6f330] [c000000000ae1474] schedule_
[12748.673485] [c000000f1fb6f420] [c000000000ade23c] wait_for_
[12748.699192] [c000000f1fb6f4a0] [c0000000000e6908] kthread_
[12748.718385] [c000000f1fb6f4e0] [d000000009514240] nvme_dev_
[12748.748925] [c000000f1fb6f510] [d000000009516498] nvme_dev_
[12748.752112] [c000000f1fb6f5f0] [d000000009516ae4] nvme_timeout+
[12748.775395] [c000000f1fb6f6b0] [c000000000548a34] blk_mq_
[12748.821069] [c000000f1fb6f720] [c00000000054c540] bt_for_
[12748.851733] [c000000f1fb6f790] [c00000000054d4e8] blk_mq_
[12748.883093] [c000000f1fb6f7e0] [c000000000547358] blk_mq_
[12748.918348] [c000000f1fb6f820] [c00000000014a13c] call_timer_
[12748.934743] [c000000f1fb6f8b0] [c00000000014a5fc] run_timer_
[12748.938084] [c000000f1fb6f980] [c0000000000beb78] __do_softirq+
[12748.960815] [c000000f1fb6fa70] [c0000000000bf048] irq_exit+0xc8/0x100
[12748.992175] [c000000f1fb6fa90] [c00000000001f954] timer_interrupt
[12749.019299] [c000000f1fb6fac0] [c000000000002714] decrementer_
[12749.037168] --- interrupt: 901 at arch_local_
[12749.037168] LR = arch_local_
[12749.079044] [c000000f1fb6fdb0] [c000000f2a41d680] 0xc000000f2a41d680 (unreliable)
[12749.081736] [c000000f1fb6fdd0] [c000000000909a28] cpuidle_
[12749.127094] [c000000f1fb6fe30] [c000000000119a88] call_cpuidle+
[12749.144435] [c000000f1fb6fe70] [c000000000119e5c] cpu_startup_
[12749.166156] [c000000f1fb6ff30] [c00000000004563c] start_secondary
[12749.186929] [c000000f1fb6ff90] [c000000000008b6c] start_secondary
[12749.223828] Instruction dump:
[12749.223856] 4e800020 4bf83a5d 60000000 4bffff64 4bf83a51 60000000 4bffffa8 3c62ff7b
[12749.233245] 7f84e378 38630fe0 489b900d 60000000 <0fe00000> 4bfffe20 7d2903a6 387d0118
[12749.298371] ---[ end trace b4d8d720952460b6 ]---
== Comment: #184 - Gabriel Krisman Bertazi <email address hidden> - 2016-07-29 12:55:48 ==
I got it figured out. The nvme driver is not playing nice with the block timeout infrastructure, in a way that the timeout code goes into a live lock, waiting for the queue to be released. CPU hotplug, on the other hand, who is holding the queue freeze lock at the time, is waiting for an outstanding request to timeout (or complete). This request, in turn is stuck in the device, requiring a reset triggered by a timeout, which never happens due to the live lock.
I don't have the reason why the request is stuck inside the device requiring a timeout, but this could even be caused by the Leaf firmware itself. I also see some successful timeouts triggered under normal conditions. In the failure event, we should be able to abort the request normally, but this happens via the timeout infrastructure, which is blocked during cpu hotplug events.
I have a quirk to fully recover after the failure, by forcing a reset of the stucked IO, which allows the cpu hotplug completion and block layer recovery. I have a machine hitting the failure every few minutes in a loop, and recovering from it with my patch.
Patch submitted to linux-block
https:/
== Comment: #207 - Gabriel Krisman Bertazi <email address hidden> - 2016-09-05 09:13:51 ==
Canonical,
This is fixed by:
e57690fe009b ("blk-mq: don't overwrite rq->mq_ctx")
0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
71f79fb3179e (" blk-mq: Allow timeouts to run while queue is freezing")
Which will apply cleanly on top of your kernel.
Changed in linux (Ubuntu): | |
status: | New → Incomplete |
status: | Incomplete → Confirmed |
Changed in linux (Ubuntu): | |
assignee: | Taco Screen team (taco-screen-team) → nobody |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
status: | Confirmed → Triaged |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
tags: | removed: bugnameltc-140718 severity-critical |
tags: | added: bugnameltc-140718 severity-critical |
tags: |
added: verification-needed-xenial removed: verification-failed-xenial |
tags: |
added: verification-done-xenial removed: verification-needed-xenial |
Default Comment by Bridge