Ubuntu
linux package

blk-mq: possible deadlock on CPU hot(un)plug

Bug #1670634 reported by bugproxy on 2017-03-07

This bug affects 2 people

	Status	Importance	Assigned to
Ubuntu on IBM z Systems	Fix Released	High	Canonical Kernel Team
linux (Ubuntu)	Fix Released	High	Juerg Haefliger
Xenial	Fix Released	Undecided	Unassigned

Bug Description

== Comment: #0 - Carsten Jacobi <email address hidden> - 2017-03-07 03:35:31 ==
I'm evaluating Ubuntu-Xenial on z for development purposes, the test system is installed in an LPAR with one FCP-LUN which is accessable by 4 pathes (all pathes are configured).
The system hangs regularly when I make packages with "pdebuild" using the pbuilder packaging suit.
The local Linux development team helped me out with a pre-analysis that I can post here (thanks a lot for that):

With the default settings and under a certain workload,
blk_mq seems to get into a presumed "deadlock".
Possibly this happens on CPU hot(un)plug.

After the I/O stalled, a dump was pulled manually.
The following information is from the crash dump pre-analysis.

$ zgetdump -i dump.0
General dump info:
  Dump format........: elf
  Version............: 1
  UTS node name......: mclint
  UTS kernel release.: 4.4.0-65-generic
  UTS kernel version.: #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017
  System arch........: s390x (64 bit)
  CPU count (online).: 2
  Dump memory range..: 8192 MB
Memory map:
  0000000000000000 - 00000001b831afff (7043 MB)
  00000001b831b000 - 00000001ffffffff (1149 MB)

Things look similarly with HWE kernel ubuntu16.04-4.8.0-34.36~16.04.1.

      KERNEL: vmlinux.full
    DUMPFILE: dump.0
        CPUS: 2
        DATE: Fri Mar 3 14:31:07 2017
      UPTIME: 02:11:20
LOAD AVERAGE: 13.00, 12.92, 11.37
       TASKS: 411
    NODENAME: mclint
     RELEASE: 4.4.0-65-generic
     VERSION: #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017
     MACHINE: s390x (unknown Mhz)
      MEMORY: 7.8 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper/0"
        TASK: bad528 (1 of 2) [THREAD_INFO: b78000]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)
        INFO: no panic task found

crash> dev -d
MAJOR GENDISK NAME REQUEST_QUEUE TOTAL ASYNC SYNC DRV
...
    8 1e1d6d800 sda 1e1d51210 0 23151 4294944145 N/A(MQ)
    8 1e4e06800 sdc 2081b18 0 23148 4294944148 N/A(MQ)
    8 1f07800 sdb 20c7568 0 23195 4294944101 N/A(MQ)
    8 1e4e06000 sdd 1e4e31210 0 23099 4294944197 N/A(MQ)
  252 1e1d6c800 dm-0 1e1d51b18 9 1 8 N/A(MQ)
...

So both dm-mpath and sd have requests pending in their block multiqueue.
The large numbers of sd look strange and seem to be the unsigned formatting of the values shown for async multiplied by -1.

[ 0.798256] Linux version 4.4.0-65-generic (buildd@z13-011) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017 (Ubuntu 4.4.0-65.86-generic 4.4.49)
[ 0.798262] setup: Linux is running natively in 64-bit mode
[ 0.798290] setup: Max memory size: 8192MB
[ 0.798298] setup: Reserving 196MB of memory at 7996MB for crashkernel (System RAM: 7996MB)

[ 0.836923] Kernel command line: root=/dev/mapper/mclint_vg-root rootflags=subvol=@ crashkernel=196M BOOT_IMAGE=0

[ 5281.179428] INFO: task xfsaild/dm-11:1604 blocked for more than 120 seconds.
[ 5281.179437] Not tainted 4.4.0-65-generic #86-Ubuntu
[ 5281.179438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.179440] xfsaild/dm-11 D 00000000007bcf52 0 1604 2 0x00000000
[ 5281.179444] 00000001e931c230 00000000001a6964 00000001e6f9b958 00000001e6f9b9d8
                      00000001e15795f0 00000001e6f9b988 0000000000ce8c00 00000001ea805c70
                      00000001ea805c00 0000000000ba5ed0 00000001e931c1d0 00000001e1579b20
                      00000001ea805c00 00000001e15795f0 00000001ea805c00 0000000000000000
                      00000000007d3978 00000000007bc9f8 00000001e6f9b9d8 00000001e6f9ba40
[ 5281.179454] Call Trace:
[ 5281.179461] ([<00000000007bc9f8>] __schedule+0x300/0x810)
[ 5281.179462] [<00000000007bcf52>] schedule+0x4a/0xb0
[ 5281.179465] [<00000000007c02aa>] schedule_timeout+0x232/0x2a8
[ 5281.179466] [<00000000007bde50>] wait_for_common+0x110/0x1c8
[ 5281.179472] [<000000000017b602>] flush_work+0x42/0x58
[ 5281.179564] [<000003ff805e14ba>] xlog_cil_force_lsn+0x7a/0x238 [xfs]
[ 5281.179589] [<000003ff805dee82>] _xfs_log_force+0x9a/0x2e8 [xfs]
[ 5281.179615] [<000003ff805df114>] xfs_log_force+0x44/0x100 [xfs]
[ 5281.179640] [<000003ff805ec668>] xfsaild+0x170/0x798 [xfs]
[ 5281.179643] [<000000000018335a>] kthread+0x10a/0x110
[ 5281.179645] [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc
[ 5281.179646] [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc

see below

[ 5281.179664] INFO: task cpuplugd:2260 blocked for more than 120 seconds.
[ 5281.179665] Not tainted 4.4.0-65-generic #86-Ubuntu
[ 5281.179666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.179668] cpuplugd D 00000000007bcf52 0 2260 1 0x00000000
[ 5281.179670] 00000001e782e318 00000000001a6964 000000007bc4ba58 000000007bc4bad8
                      00000001d4076db0 000000007bc4ba88 0000000000ce8c00 00000001ea805c70
                      00000001ea805c00 0000000000ba5ed0 00000001e782e2b8 00000001d40772e0
                      00000001ea805c00 00000001d4076db0 00000001ea805c00 0000000000000000
                      00000000007d3978 00000000007bc9f8 000000007bc4bad8 000000007bc4bb40
[ 5281.179678] Call Trace:
[ 5281.179680] ([<00000000007bc9f8>] __schedule+0x300/0x810)
[ 5281.179681] [<00000000007bcf52>] schedule+0x4a/0xb0
[ 5281.179685] [<0000000000516cc2>] blk_mq_freeze_queue_wait+0x62/0xc8
[ 5281.179687] [<0000000000519412>] blk_mq_queue_reinit_notify+0x11a/0x240
[ 5281.179690] [<00000000001844c6>] notifier_call_chain+0x56/0x98
[ 5281.179692] [<000000000018466a>] __raw_notifier_call_chain+0x2a/0x38
[ 5281.179696] [<00000000001605ac>] _cpu_up+0x10c/0x1b0
[ 5281.179698] [<0000000000160738>] cpu_up+0xe8/0x108
[ 5281.179700] [<00000000005d08be>] cpu_subsys_online+0x56/0xb0
[ 5281.179703] [<00000000005ca1c2>] device_online+0x82/0xc0
[ 5281.179704] [<00000000005ca28a>] online_store+0x8a/0x98
[ 5281.179710] [<00000000003a4d12>] kernfs_fop_write+0x13a/0x190
[ 5281.179712] [<000000000031218c>] vfs_write+0x94/0x1a0
[ 5281.179714] [<0000000000312e9e>] SyS_write+0x66/0xd8
[ 5281.179715] [<00000000007c0e3e>] system_call+0xd6/0x264
[ 5281.179718] [<000003ff803df478>] zlib_tr_flush_block+0x650/0x830 [zlib_deflate]

Cpuplugd performs CPU hot(un)plug based on configurable rules.
https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_r_cpuplugdcmd.html
https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_t_cpu_act.html
https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_r_numa_know_cpu.html

[ 5281.179769] INFO: task kworker/0:2:23669 blocked for more than 120 seconds.
[ 5281.179770] Not tainted 4.4.0-65-generic #86-Ubuntu
[ 5281.179771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.179773] kworker/0:2 D 00000000007bcf52 0 23669 2 0x00000000
[ 5281.179801] Workqueue: xfs-cil/dm-11 xlog_cil_push_work [xfs]
[ 5281.179802] 00000001c925e318 00000000001a6964 0000000068c5f9f8 0000000068c5fa78
                      00000001da99b6d8 0000000068c5fa10 0000000000ce8c00 00000001ea805c70
                      00000001ea805c00 00000001e782e94c 00000001c925e2b8 00000001da99bc08
                      00000001ea805c00 00000001da99b6d8 00000001ea805c00 0000000000000000
                      00000000007d3978 00000000007bc9f8 0000000068c5fa78 0000000068c5fae0
[ 5281.179810] Call Trace:
[ 5281.179812] ([<00000000007bc9f8>] __schedule+0x300/0x810)
[ 5281.179813] [<00000000007bcf52>] schedule+0x4a/0xb0
[ 5281.179839] [<000003ff805de144>] xlog_state_get_iclog_space+0x124/0x338 [xfs]
[ 5281.179864] [<000003ff805de702>] xlog_write+0x1ea/0x800 [xfs]
[ 5281.179890] [<000003ff805e09a6>] xlog_cil_push+0x286/0x508 [xfs]
[ 5281.179891] [<000000000017c400>] process_one_work+0x1a0/0x4f8
[ 5281.179893] [<000000000017c7a2>] worker_thread+0x4a/0x530
[ 5281.179894] [<000000000018335a>] kthread+0x10a/0x110
[ 5281.179896] [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc
[ 5281.179898] [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc

While above kworker executes a work item for a long duration, other processes block in turn on flush_work for a long duration.

[ 5281.179728] INFO: task kworker/0:1:4454 blocked for more than 120 seconds.
[ 5281.179730] Not tainted 4.4.0-65-generic #86-Ubuntu
[ 5281.179731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.179732] kworker/0:1 D 00000000007bcf52 0 4454 2 0x00000000
[ 5281.179738] Workqueue: events vmstat_shepherd
[ 5281.179739] 00000001c925ac40 00000000001a6964 000000007eb8bb38 000000007eb8bbb8
                      00000001e782e2b8 000000007eb8bb50 0000000000ce8c00 00000001ea805c70
                      00000001c925abe0 00000001c925b274 00000001c925abe0 00000001e782e7e8
                      00000001ea805c00 00000001e782e2b8 00000001ea805c00 0000000000000000
                      00000000007d3978 00000000007bc9f8 000000007eb8bbb8 000000007eb8bc20
[ 5281.179747] Call Trace:
[ 5281.179749] ([<00000000007bc9f8>] __schedule+0x300/0x810)
[ 5281.179750] [<00000000007bcf52>] schedule+0x4a/0xb0
[ 5281.179752] [<00000000007bd39a>] schedule_preempt_disabled+0x2a/0x38
[ 5281.179753] [<00000000007becc4>] __mutex_lock_slowpath+0xcc/0x170
[ 5281.179755] [<00000000007bedc6>] mutex_lock+0x5e/0x78
[ 5281.179756] [<000000000015fba0>] get_online_cpus+0x40/0x68
[ 5281.179757] [<00000000002a3ccc>] vmstat_shepherd+0x44/0x168
[ 5281.179759] [<000000000017c400>] process_one_work+0x1a0/0x4f8
[ 5281.179761] [<000000000017c7a2>] worker_thread+0x4a/0x530
[ 5281.179762] [<000000000018335a>] kthread+0x10a/0x110
[ 5281.179764] [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc
[ 5281.179765] [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc

This work item cannot progress maybe because cpuplugd:2260 above "hangs" in the cpu hotplug notifier chain.

The low level device driver (here zfcp) is completely idle without any pending I/O after the lockup happened and all its paths are in good state and could service I/O but it simply does not get any new I/O requests from the upper layers (scsi / block). Zfcp does not implement blk_mq so dm or scsi translate which works in general but fails with above workload. There were no other undesired events, i.e. no path interruptions nor any recovery in zfcp.

Tags:

CVE References

2017-14106

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-07: Log output in /var/log/syslog until the system runs into the hang scenario

Log output in /var/log/syslog until the system runs into the hang scenario Edit (17.1 KiB, text/plain)

Default Comment by Bridge

tags:

added: architecture-s39064 bugnameltc-152303 severity-high targetmilestone-inin16041

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-07: Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-03-07 06:33 EDT-------
[CAN] we have addl. informations (dumps. etc) where can be provide such information to you. approx 1GByte....

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2017-03-07:

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1670634/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags:

added: bot-comment

Frank Heimes (fheimes) on 2017-03-07

tags:

added: s390x
removed: bot-comment

Frank Heimes (fheimes) on 2017-03-07

affects:

ubuntu → linux (Ubuntu)

Frank Heimes (fheimes) on 2017-03-07

Changed in ubuntu-z-systems:
assignee:	nobody → Canonical Kernel Team (canonical-kernel-team)
importance:	Undecided → High

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-07:

------- Comment From <email address hidden> 2017-03-07 11:44 EDT-------
Please provide Debug-Info to this IBM_BOX folder
https://ibm.box.com/s/y10o4u7bcvc6nk7rgk2gfmk039ii5d1i
After Debugging this folder will be deleted.

Revision history for this message

Douglas Miller (dougmill-ibm) wrote on 2017-03-07:

This looks a lot like the problem in LP #1662673, but that fix is supposed to be in kernel 4.4.0-65-generic. Might be worth confirming though. Or perhaps confirming that the fix actually works on the Z architecture (depends on how the architecture/compiler handles 'bool').

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2017-03-07:

Douglas - The patch referred to in LP #1662673 ("percpu-refcount: fix reference leak during percpu-atomic transition") is in Ubuntu-4.4.0-65.86 which has yet to be released.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-08:

Download full text (4.6 KiB)

------- Comment From <email address hidden> 2017-03-08 05:56 EDT-------
Just tried the newest Kernel 4.4.0-66, and I'm still running into the hang. Here the final statements in /var/log/syslog (the lines, that never make it out onto the disk):

Mar 8 11:26:31 mclint multipathd[955]: mpatha: sdb - tur checker timed out
Mar 8 11:26:31 mclint multipathd[955]: 8:16: reinstated
Mar 8 11:26:31 mclint multipathd[955]: mpatha: sdd - tur checker timed out
Mar 8 11:26:31 mclint rsyslogd-2007: action 'action 10' suspended, next retry is Wed Mar 8 11:27:01 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 8 11:26:31 mclint multipathd[955]: 8:48: reinstated
Mar 8 11:26:31 mclint multipathd[955]: mpatha: sdc - tur checker timed out
Mar 8 11:26:31 mclint multipathd[955]: 8:32: reinstated
Mar 8 11:26:32 mclint multipathd[955]: mpatha: sda - tur checker timed out
Mar 8 11:26:32 mclint multipathd[955]: 8:0: reinstated

And this here shows up on the sclp_line console:

? 961.419327! INFO: task cpuplugd:2604 blocked for more than 120 seconds.
? 961.419337! Not tainted 4.4.0-66-generic #87-Ubuntu
? 961.419338! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.419404! INFO: task irqbalance:2651 blocked for more than 120 seconds.
? 961.419406! Not tainted 4.4.0-66-generic #87-Ubuntu
? 961.419407! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.419450! INFO: task kworker/0:4:3801 blocked for more than 120 seconds.
? 961.419451! Not tainted 4.4.0-66-generic #87-Ubuntu
? 961.419452! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.419494! INFO: task kworker/1:1:4548 blocked for more than 120 seconds.
? 961.419495! Not tainted 4.4.0-66-generic #87-Ubuntu
? 961.419496! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.419539! INFO: task kworker/0:0H:20302 blocked for more than 120 seconds.
? 961.419540! Not tainted 4.4.0-66-generic #87-Ubuntu
? 961.419541! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.419764! INFO: task kworker/0:0:66641 blocked for more than 120 seconds.
? 961.419766! Not tainted 4.4.0-66-generic #87-Ubuntu
? 961.419767! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.419895! INFO: task rm:81710 blocked for more than 120 seconds.
? 961.419896! Not tainted 4.4.0-66-generic #87-Ubuntu
? 961.419897! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 1081.419024! INFO: task systemd:1 blocked for more than 120 seconds.
? 1081.419033! Not tainted 4.4.0-66-generic #87-Ubuntu
? 1081.419035! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 1081.419148! INFO: task cpuplugd:2604 blocked for more than 120 seconds.
? 1081.419150! Not tainted 4.4.0-66-generic #87-Ubuntu
? 1081.419151! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 1081.419186! INFO: task irqbalance:2651 blocked for more than 120 seconds.
? 1081.419187! Not tainted 4.4.0-66-generic #87-Ubuntu
? 1081.419188! "echo 0 > /proc/sys/kernel/hung_task_tim...

------- Comment From jacobi@de.ibm.com 2017-03-08 05:56 EDT-------
Just tried the newest Kernel 4.4.0-66, and I'm still running into the hang. Here the final statements in /var/log/syslog (the lines, that never make it out onto the disk):

Mar  8 11:26:31 mclint multipathd[955]: mpatha: sdb - tur checker timed out
Mar  8 11:26:31 mclint multipathd[955]: 8:16: reinstated
Mar  8 11:26:31 mclint multipathd[955]: mpatha: sdd - tur checker timed out
Mar  8 11:26:31 mclint rsyslogd-2007: action 'action 10' suspended, next retry is Wed Mar  8 11:27:01 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar  8 11:26:31 mclint multipathd[955]: 8:48: reinstated
Mar  8 11:26:31 mclint multipathd[955]: mpatha: sdc - tur checker timed out
Mar  8 11:26:31 mclint multipathd[955]: 8:32: reinstated
Mar  8 11:26:32 mclint multipathd[955]: mpatha: sda - tur checker timed out
Mar  8 11:26:32 mclint multipathd[955]: 8:0: reinstated

And this here shows up on the sclp_line console:

?  961.419327! INFO: task cpuplugd:2604 blocked for more than 120 seconds.
?  961.419337!       Not tainted 4.4.0-66-generic #87-Ubuntu
?  961.419338! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.419404! INFO: task irqbalance:2651 blocked for more than 120 seconds.
?  961.419406!       Not tainted 4.4.0-66-generic #87-Ubuntu
?  961.419407! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.419450! INFO: task kworker/0:4:3801 blocked for more than 120 seconds.
?  961.419451!       Not tainted 4.4.0-66-generic #87-Ubuntu
?  961.419452! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.419494! INFO: task kworker/1:1:4548 blocked for more than 120 seconds.
?  961.419495!       Not tainted 4.4.0-66-generic #87-Ubuntu
?  961.419496! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.419539! INFO: task kworker/0:0H:20302 blocked for more than 120 seconds.
?  961.419540!       Not tainted 4.4.0-66-generic #87-Ubuntu
?  961.419541! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.419764! INFO: task kworker/0:0:66641 blocked for more than 120 seconds.
?  961.419766!       Not tainted 4.4.0-66-generic #87-Ubuntu
?  961.419767! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.419895! INFO: task rm:81710 blocked for more than 120 seconds.
?  961.419896!       Not tainted 4.4.0-66-generic #87-Ubuntu
?  961.419897! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 1081.419024! INFO: task systemd:1 blocked for more than 120 seconds.
? 1081.419033!       Not tainted 4.4.0-66-generic #87-Ubuntu
? 1081.419035! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 1081.419148! INFO: task cpuplugd:2604 blocked for more than 120 seconds.
? 1081.419150!       Not tainted 4.4.0-66-generic #87-Ubuntu
? 1081.419151! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 1081.419186! INFO: task irqbalance:2651 blocked for more than 120 seconds.
? 1081.419187!       Not tainted 4.4.0-66-generic #87-Ubuntu
? 1081.419188! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

I pulled a DASD-Dump from the system:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-66-generic
DUMPFILE: mclint_20170308_kernel_4_4_0-66_without_openafs.dump
CPUS: 3
DATE: Wed Mar  8 11:37:56 2017
UPTIME: 00:25:30
LOAD AVERAGE: 12.99, 11.25, 6.55
TASKS: 422
NODENAME: mclint
RELEASE: 4.4.0-66-generic
VERSION: #87-Ubuntu SMP Fri Mar 3 15:32:53 UTC 2017
MACHINE: s390x  (unknown Mhz)
MEMORY: 7.8 GB
PANIC: ""
PID: 0
COMMAND: "swapper/0"
TASK: bb1538  (1 of 3)  [THREAD_INFO: b7c000]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
INFO: no panic task found

And again I see 10 multipath-Daemons in the process list, this is my typical hang scenario.

crash> ps | grep multipathd
955      1   0      1e49115f0     IN   0.1  335364   8316  multipathd
971      1   0      7e8b2be0      IN   0.1  335364   8316  multipathd
972      1   0      7e8b6db0      IN   0.1  335364   8316  multipathd
977      1   1      7e8b36d8      IN   0.1  335364   8316  multipathd
978      1   0      7e8b62b8      IN   0.1  335364   8316  multipathd
979      1   2      7e8b4cc8      IN   0.1  335364   8316  multipathd
81714      1   1      7cdc8000      IN   0.1  335364   8316  multipathd
81715      1   1      7cdc95f0      IN   0.1  335364   8316  multipathd
81716      1   1      7cdcc1d0      IN   0.1  335364   8316  multipathd
81717      1   1      1e6c595f0     IN   0.1  335364   8316  multipathd

I'll compress the dump and try to find ways to make it available to you ...

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2017-03-08:

Due to an emergency CVE rebase, that patch still hasn't made it into the wild. Here is a test kernel that definitely has the patch

http://kernel.ubuntu.com/~rtg/lp1670634/

Revision history for this message

Douglas Miller (dougmill-ibm) wrote on 2017-03-08:

I'm confused about the release mechanics, I guess. Looking at the git repository, I see tag "Ubuntu-4.4.0-65.86" (for example) and that tag commit does contain the fix. Is it then possible for a kernel labeled "4.4.0-65-generic #86" to NOT contain that patch? Am I making a gross assumption that these tags reflect what was released?

Revision history for this message

Douglas Miller (dougmill-ibm) wrote on 2017-03-08:

#10

Carsten, I am currently thinking there are two possibilities here. Maybe three.

1) The fix I submitted is not in the kernel(s) you are running.
2) The s390 compiler does not produce the necessary code to implicitly convert long int to bool.
3) You are hitting a different bug that just happens to look the same.

For #2, a simple compiler test could be done to check what code is produced when assigning a long int value to a bool (GCC _Bool). If you want to pursue that let me know. I am not familiar with s390 object code, so we might need someone to interpret the objdump. As far as I can tell, the s390 Linux kernel does use GCC _Bool for the data type "bool", so it would then be a matter of what code the s390 GCC produces in this case.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-09:

#11

------- Comment From <email address hidden> 2017-03-09 11:28 EDT-------
Ok, I just tested the kernel from http://kernel.ubuntu.com/~rtg/lp1670634/ and so far this looks good! I was able to Make the OpenAFS Ubuntu package three times with pdebuild without running into a hang, and this was a very good candidate to run into the hang scenario.
I'd like to activate OpenAFS on that system so that I'll again be able to also start to run jobs as users other than root and also to build more packages for Ubuntu, and at the moment there is only one subtle obstacle:

root@mclint:/var/lib/dkms/openafs/1.6.20.1/build# file /usr/src/linux-headers-4.4.0-67-generic/scripts/basic/fixdep
/usr/src/linux-headers-4.4.0-67-generic/scripts/basic/fixdep: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=4f746ae15cb57aa0f264c965a8061844f5f21fa2, not stripped
root@mclint:/var/lib/dkms/openafs/1.6.20.1/build# dpkg -S /usr/src/linux-headers-4.4.0-67-generic/scripts/basic/fixdep
linux-headers-4.4.0-67-generic: /usr/src/linux-headers-4.4.0-67-generic/scripts/basic/fixdep

Somehow, you put me the x86_64 version of "fixdep" into the linux-headers package in the ~rtg/lp1670634 folder. I checked the same file on the other linux-headers packages on my system and they were present and for the s390x-architecture. I'm a little puzzled here, as the linux-headers package is an "_all.deb" I thought those packages should be free of binary files for a specific architecture. I'll try to replace the fixdep program and try on ...

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2017-03-09:

#12

An x86-64 fixdep is an artifact of cross compiling.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-13:

#13

------- Comment From <email address hidden> 2017-03-13 09:57 EDT-------
Next iteration: I was able to make OpenAFS for the proposed kernel fixed and so I was able to start the next pdebuild job .... and again I run into a hang scenario:

Mar 13 14:30:32 mclint multipathd[881]: mpatha: sda - tur checker timed out
Mar 13 14:30:32 mclint multipathd[881]: 8:0: reinstated
Mar 13 14:30:32 mclint multipathd[881]: mpatha: sdc - tur checker timed out
Mar 13 14:30:32 mclint multipathd[881]: 8:32: reinstated
Mar 13 14:30:33 mclint multipathd[881]: mpatha: sdb - tur checker timed out
Mar 13 14:30:33 mclint multipathd[881]: 8:16: reinstated
Mar 13 14:30:36 mclint multipathd[881]: mpatha: sdd - tur checker timed out
Mar 13 14:30:36 mclint rsyslogd-2007: action 'action 10' suspended, next retry is Mon Mar 13 14:32:06 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Mar 13 14:30:36 mclint multipathd[881]: 8:48: reinstated

I just pulled a dump and compress it, eventually I'll upload it to Box. I'll go on and try to reproduce this without OpenAFS and I'm very confident that this hang is not related to AFS at all ...

General dump info:
Dump format........: s390mv
Version............: 5
Dump created.......: Mon, 13 Mar 2017 14:42:39 +0100
Dump ended.........: Mon, 13 Mar 2017 14:43:17 +0100
Dump CPU ID........: 9efc729648000
UTS node name......: mclint
UTS kernel release.: 4.4.0-67-generic
UTS kernel version.: #88 SMP Wed Mar 8 14:48:51 UTC 2017
Build arch.........: s390x (64 bit)
System arch........: s390x (64 bit)
CPU count (online).: 2
CPU count (real)...: 4
Dump memory range..: 8192 MB
Real memory range..: 8192 MB

Memory map:
0000000000000000 - 00000001b831afff (7043 MB)
00000001b831b000 - 00000001ffffffff (1149 MB)

Dump device info:
Volume 0: 0.0.8409 (online/active)
Volume 1: 0.0.840a (online/active)

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-14:

#14

Download full text (3.5 KiB)

------- Comment From <email address hidden> 2017-03-14 12:07 EDT-------
I was able to run into the hand now also without openafs, /var/log/syslog:

Mar 14 15:10:46 mclint multipathd[887]: mpatha: sda - tur checker timed out
Mar 14 15:10:46 mclint multipathd[887]: 8:0: reinstated
Mar 14 15:10:46 mclint multipathd[887]: mpatha: sdb - tur checker timed out
Mar 14 15:10:46 mclint multipathd[887]: 8:16: reinstated
Mar 14 15:10:46 mclint multipathd[887]: mpatha: sdc - tur checker timed out
Mar 14 15:10:46 mclint multipathd[887]: 8:32: reinstated
Mar 14 15:10:46 mclint multipathd[887]: mpatha: sdd - tur checker timed out
Mar 14 15:10:46 mclint multipathd[887]: 8:48: reinstated

On the sclp_line console:

? 9841.149452! INFO: task btrfs-transacti:634 blocked for more than 120 seconds.

? 9841.149459! Not tainted 4.4.0-67-generic #88
? 9841.149461! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149627! INFO: task cpuplugd:2409 blocked for more than 120 seconds.
? 9841.149628! Not tainted 4.4.0-67-generic #88
? 9841.149629! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149674! INFO: task irqbalance:2515 blocked for more than 120 seconds.
? 9841.149675! Not tainted 4.4.0-67-generic #88
? 9841.149676! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149715! INFO: task kworker/0:2:3661 blocked for more than 120 seconds.
? 9841.149716! Not tainted 4.4.0-67-generic #88
? 9841.149717! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149752! INFO: task tar:16648 blocked for more than 120 seconds.
? 9841.149753! Not tainted 4.4.0-67-generic #88
? 9841.149754! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149482! INFO: task btrfs-transacti:634 blocked for more than 120 seconds.

? 9961.149489! Not tainted 4.4.0-67-generic #88
? 9961.149490! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149640! INFO: task rs:main Q:Reg:1995 blocked for more than 120 seconds.
? 9961.149642! Not tainted 4.4.0-67-generic #88
? 9961.149642! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149727! INFO: task cpuplugd:2409 blocked for more than 120 seconds.
? 9961.149729! Not tainted 4.4.0-67-generic #88
? 9961.149729! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149772! INFO: task irqbalance:2515 blocked for more than 120 seconds.
? 9961.149773! Not tainted 4.4.0-67-generic #88
? 9961.149773! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149811! INFO: task kworker/0:2:3661 blocked for more than 120 seconds.
? 9961.149812! Not tainted 4.4.0-67-generic #88
? 9961.149812! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

------- Comment From jacobi@de.ibm.com 2017-03-14 12:07 EDT-------
I was able to run into the hand now also without openafs, /var/log/syslog:

On the sclp_line console:

? 9841.149452! INFO: task btrfs-transacti:634 blocked for more than 120 seconds.

? 9841.149459!       Not tainted 4.4.0-67-generic #88
? 9841.149461! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149627! INFO: task cpuplugd:2409 blocked for more than 120 seconds.
? 9841.149628!       Not tainted 4.4.0-67-generic #88
? 9841.149629! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149674! INFO: task irqbalance:2515 blocked for more than 120 seconds.
? 9841.149675!       Not tainted 4.4.0-67-generic #88
? 9841.149676! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149715! INFO: task kworker/0:2:3661 blocked for more than 120 seconds.
? 9841.149716!       Not tainted 4.4.0-67-generic #88
? 9841.149717! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9841.149752! INFO: task tar:16648 blocked for more than 120 seconds.
? 9841.149753!       Not tainted 4.4.0-67-generic #88
? 9841.149754! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149482! INFO: task btrfs-transacti:634 blocked for more than 120 seconds.

? 9961.149489!       Not tainted 4.4.0-67-generic #88
? 9961.149490! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149640! INFO: task rs:main Q:Reg:1995 blocked for more than 120 seconds.
? 9961.149642!       Not tainted 4.4.0-67-generic #88
? 9961.149642! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149727! INFO: task cpuplugd:2409 blocked for more than 120 seconds.
? 9961.149729!       Not tainted 4.4.0-67-generic #88
? 9961.149729! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149772! INFO: task irqbalance:2515 blocked for more than 120 seconds.
? 9961.149773!       Not tainted 4.4.0-67-generic #88
? 9961.149773! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 9961.149811! INFO: task kworker/0:2:3661 blocked for more than 120 seconds.
? 9961.149812!       Not tainted 4.4.0-67-generic #88
? 9961.149812! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

I just made a new dump ->
General dump info:
Dump format........: s390mv
Version............: 5
Dump created.......: Tue, 14 Mar 2017 15:45:29 +0100
Dump ended.........: Tue, 14 Mar 2017 15:46:07 +0100
Dump CPU ID........: 9efc729648000
UTS node name......: mclint
UTS kernel release.: 4.4.0-67-generic
UTS kernel version.: #88 SMP Wed Mar 8 14:48:51 UTC 2017
Build arch.........: s390x (64 bit)
System arch........: s390x (64 bit)
CPU count (online).: 3
CPU count (real)...: 4
Dump memory range..: 8192 MB
Real memory range..: 8192 MB

The dump is currently uploaded to the Box-Folder -> mclint_20170314_kernel_4_4_0-67_without_openafs.dump.bz2

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-28:

#15

Download full text (3.6 KiB)

------- Comment From <email address hidden> 2017-03-28 12:23 EDT-------
Just tried Kernel 4.4.0-70 ->

/var/log/syslog:

Mar 28 18:07:46 mclint multipathd[888]: mpatha: sda - tur checker timed out
Mar 28 18:07:46 mclint multipathd[888]: 8:0: reinstated
Mar 28 18:07:46 mclint multipathd[888]: mpatha: sdb - tur checker timed out
Mar 28 18:07:46 mclint multipathd[888]: 8:16: reinstated
Mar 28 18:07:46 mclint multipathd[888]: mpatha: sdc - tur checker timed out
Mar 28 18:07:46 mclint multipathd[888]: 8:32: reinstated
Mar 28 18:07:46 mclint multipathd[888]: mpatha: sdd - tur checker timed out
Mar 28 18:07:46 mclint multipathd[888]: 8:48: reinstated

/dev/sclp_line0:

? 459.779353! BTRFS error (device dm-1): bdev /dev/dm-1 errs: wr 0, rd 1, flush
0, corrupt 0, gen 0
? 459.779503! BTRFS error (device dm-1): bdev /dev/dm-1 errs: wr 0, rd 2, flush
0, corrupt 0, gen 0
? 481.287452! INFO: task xfsaild/dm-11:1727 blocked for more than 120 seconds.
? 481.287459! Not tainted 4.4.0-70-generic #91-Ubuntu
? 481.287461! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.287647! INFO: task cpuplugd:2402 blocked for more than 120 seconds.
? 481.287648! Not tainted 4.4.0-70-generic #91-Ubuntu
? 481.287649! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.287696! INFO: task irqbalance:2508 blocked for more than 120 seconds.
? 481.287697! Not tainted 4.4.0-70-generic #91-Ubuntu
? 481.287698! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.287740! INFO: task kworker/0:19:22353 blocked for more than 120 seconds.
? 481.287741! Not tainted 4.4.0-70-generic #91-Ubuntu
? 481.287742! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.287769! INFO: task kworker/0:21:22355 blocked for more than 120 seconds.
? 481.287770! Not tainted 4.4.0-70-generic #91-Ubuntu
? 481.287771! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.287875! INFO: task tar:22484 blocked for more than 120 seconds.
? 481.287876! Not tainted 4.4.0-70-generic #91-Ubuntu
? 481.287877! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 601.288111! INFO: task systemd:1 blocked for more than 120 seconds.
? 601.288118! Not tainted 4.4.0-70-generic #91-Ubuntu
? 601.288119! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 601.288207! INFO: task xfsaild/dm-11:1727 blocked for more than 120 seconds.
? 601.288208! Not tainted 4.4.0-70-generic #91-Ubuntu
? 601.288209! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 601.288372! INFO: task rs:main Q:Reg:2002 blocked for more than 120 seconds.
? 601.288374! Not tainted 4.4.0-70-generic #91-Ubuntu
? 601.288374! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 601.288496! INFO: task cpuplugd:2402 blocked for more than 120 seconds.
? 601.288497! Not tainted 4.4.0-70-generic #91-Ubuntu
? 601.288497! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

crashdump-info:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-70-...

------- Comment From jacobi@de.ibm.com 2017-03-28 12:23 EDT-------
Just tried Kernel 4.4.0-70 ->

/var/log/syslog:

/dev/sclp_line0:

?  459.779353! BTRFS error (device dm-1): bdev /dev/dm-1 errs: wr 0, rd 1, flush
0, corrupt 0, gen 0
?  459.779503! BTRFS error (device dm-1): bdev /dev/dm-1 errs: wr 0, rd 2, flush
0, corrupt 0, gen 0
?  481.287452! INFO: task xfsaild/dm-11:1727 blocked for more than 120 seconds.
?  481.287459!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  481.287461! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.287647! INFO: task cpuplugd:2402 blocked for more than 120 seconds.
?  481.287648!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  481.287649! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.287696! INFO: task irqbalance:2508 blocked for more than 120 seconds.
?  481.287697!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  481.287698! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.287740! INFO: task kworker/0:19:22353 blocked for more than 120 seconds.
?  481.287741!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  481.287742! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.287769! INFO: task kworker/0:21:22355 blocked for more than 120 seconds.
?  481.287770!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  481.287771! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.287875! INFO: task tar:22484 blocked for more than 120 seconds.
?  481.287876!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  481.287877! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  601.288111! INFO: task systemd:1 blocked for more than 120 seconds.
?  601.288118!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  601.288119! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  601.288207! INFO: task xfsaild/dm-11:1727 blocked for more than 120 seconds.
?  601.288208!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  601.288209! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  601.288372! INFO: task rs:main Q:Reg:2002 blocked for more than 120 seconds.
?  601.288374!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  601.288374! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  601.288496! INFO: task cpuplugd:2402 blocked for more than 120 seconds.
?  601.288497!       Not tainted 4.4.0-70-generic #91-Ubuntu
?  601.288497! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

crashdump-info:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-70-generic
DUMPFILE: mclint_20170328_kernel_4_4_0-70_without_openafs.dump
CPUS: 3
DATE: Tue Mar 28 18:15:25 2017
UPTIME: 00:14:23
LOAD AVERAGE: 13.22, 10.22, 5.28
TASKS: 408
NODENAME: mclint
RELEASE: 4.4.0-70-generic
VERSION: #91-Ubuntu SMP Wed Mar 22 12:48:02 UTC 2017
MACHINE: s390x  (unknown Mhz)
MEMORY: 7.8 GB
PANIC: ""
PID: 0
COMMAND: "swapper/0"
TASK: bad528  (1 of 3)  [THREAD_INFO: b78000]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
INFO: no panic task found

I will upload the compressed dump to Box soon ...

Revision history for this message

bugproxy (bugproxy) wrote on 2017-03-30:

#16

------- Comment From <email address hidden> 2017-03-30 08:50 EDT-------
New Kernel, new hang ...

? 961.242228! INFO: task kworker/1:1:38 blocked for more than 120 seconds.
? 961.242235! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.242236! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.242480! INFO: task xfsaild/dm-11:1742 blocked for more than 120 seconds.
? 961.242481! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.242482! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.242933! INFO: task rs:main Q:Reg:2043 blocked for more than 120 seconds.
? 961.242934! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.242935! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.243407! INFO: task cpuplugd:2355 blocked for more than 120 seconds.
? 961.243409! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.243410! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.243544! INFO: task irqbalance:2447 blocked for more than 120 seconds.
? 961.243546! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.243546! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.243617! INFO: task kworker/0:2H:3385 blocked for more than 120 seconds.
? 961.243618! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.243619! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.243911! INFO: task kworker/0:1H:6035 blocked for more than 120 seconds.
? 961.243912! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.243913! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.244405! INFO: task kworker/0:2:22440 blocked for more than 120 seconds.
? 961.244406! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.244407! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.244543! INFO: task kworker/0:4:22938 blocked for more than 120 seconds.
? 961.244545! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.244545! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 961.245404! INFO: task dpkg:24617 blocked for more than 120 seconds.
? 961.245405! Not tainted 4.4.0-71-generic #92-Ubuntu
? 961.245406! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

And the dump:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-71-generic
DUMPFILE: mclint_20170330_kernel_4_4_0-71_without_openafs.dump
CPUS: 1
DATE: Thu Mar 30 12:14:27 2017
UPTIME: 00:24:42
LOAD AVERAGE: 14.32, 12.51, 7.23
TASKS: 407
NODENAME: mclint
RELEASE: 4.4.0-71-generic
VERSION: #92-Ubuntu SMP Fri Mar 24 13:03:47 UTC 2017
MACHINE: s390x (unknown Mhz)
MEMORY: 7.8 GB
PANIC: ""
PID: 0
COMMAND: "swapper/0"
TASK: bad528 [THREAD_INFO: b78000]
CPU: 0
STATE: TASK_RUNNING
INFO: no panic task found

The dump is already uploaded to Box

------- Comment From jacobi@de.ibm.com 2017-03-30 08:50 EDT-------
New Kernel, new hang ...

?  961.242228! INFO: task kworker/1:1:38 blocked for more than 120 seconds.
?  961.242235!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.242236! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.242480! INFO: task xfsaild/dm-11:1742 blocked for more than 120 seconds.
?  961.242481!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.242482! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.242933! INFO: task rs:main Q:Reg:2043 blocked for more than 120 seconds.
?  961.242934!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.242935! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.243407! INFO: task cpuplugd:2355 blocked for more than 120 seconds.
?  961.243409!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.243410! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.243544! INFO: task irqbalance:2447 blocked for more than 120 seconds.
?  961.243546!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.243546! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.243617! INFO: task kworker/0:2H:3385 blocked for more than 120 seconds.
?  961.243618!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.243619! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.243911! INFO: task kworker/0:1H:6035 blocked for more than 120 seconds.
?  961.243912!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.243913! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.244405! INFO: task kworker/0:2:22440 blocked for more than 120 seconds.
?  961.244406!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.244407! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.244543! INFO: task kworker/0:4:22938 blocked for more than 120 seconds.
?  961.244545!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.244545! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  961.245404! INFO: task dpkg:24617 blocked for more than 120 seconds.
?  961.245405!       Not tainted 4.4.0-71-generic #92-Ubuntu
?  961.245406! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

And the dump:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-71-generic
DUMPFILE: mclint_20170330_kernel_4_4_0-71_without_openafs.dump
CPUS: 1
DATE: Thu Mar 30 12:14:27 2017
UPTIME: 00:24:42
LOAD AVERAGE: 14.32, 12.51, 7.23
TASKS: 407
NODENAME: mclint
RELEASE: 4.4.0-71-generic
VERSION: #92-Ubuntu SMP Fri Mar 24 13:03:47 UTC 2017
MACHINE: s390x  (unknown Mhz)
MEMORY: 7.8 GB
PANIC: ""
PID: 0
COMMAND: "swapper/0"
TASK: bad528  [THREAD_INFO: b78000]
CPU: 0
STATE: TASK_RUNNING
INFO: no panic task found

The dump is already uploaded to Box

Joseph Salisbury (jsalisbury) on 2017-03-30

Changed in linux (Ubuntu):
importance:	Undecided → High
tags:	added: kernel-key

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06:

#17

Download full text (3.4 KiB)

------- Comment From <email address hidden> 2017-04-06 11:53 EDT-------
Same procedure as every kernel, /var/log/syslog:

Apr 6 16:17:29 mclint multipathd[881]: mpatha: sda - tur checker timed out
Apr 6 16:17:29 mclint multipathd[881]: 8:0: reinstated
Apr 6 16:17:29 mclint multipathd[881]: mpatha: sdb - tur checker timed out
Apr 6 16:17:29 mclint multipathd[881]: 8:16: reinstated
Apr 6 16:17:29 mclint multipathd[881]: mpatha: sdd - tur checker timed out
Apr 6 16:17:29 mclint multipathd[881]: 8:48: reinstated
Apr 6 16:17:29 mclint multipathd[881]: mpatha: sdc - tur checker timed out
Apr 6 16:17:29 mclint multipathd[881]: 8:32: reinstated

/dev/sclp_line0:

? 361.418628! INFO: task kworker/1:4:860 blocked for more than 120 seconds.
? 361.418635! Not tainted 4.4.0-72-generic #93-Ubuntu
? 361.418637! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 361.418722! INFO: task cpuplugd:2310 blocked for more than 120 seconds.
? 361.418723! Not tainted 4.4.0-72-generic #93-Ubuntu
? 361.418723! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 361.418766! INFO: task irqbalance:2416 blocked for more than 120 seconds.
? 361.418767! Not tainted 4.4.0-72-generic #93-Ubuntu
? 361.418768! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 361.418806! INFO: task kworker/0:2H:3403 blocked for more than 120 seconds.
? 361.418807! Not tainted 4.4.0-72-generic #93-Ubuntu
? 361.418808! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 361.418990! INFO: task kworker/0:9:4449 blocked for more than 120 seconds.
? 361.418991! Not tainted 4.4.0-72-generic #93-Ubuntu
? 361.418992! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.420013! INFO: task kworker/1:4:860 blocked for more than 120 seconds.
? 481.420020! Not tainted 4.4.0-72-generic #93-Ubuntu
? 481.420021! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.420091! INFO: task systemd-timesyn:1766 blocked for more than 120 seconds
.
? 481.420093! Not tainted 4.4.0-72-generic #93-Ubuntu
? 481.420093! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.420136! INFO: task rs:main Q:Reg:2023 blocked for more than 120 seconds.
? 481.420137! Not tainted 4.4.0-72-generic #93-Ubuntu
? 481.420138! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.420250! INFO: task cpuplugd:2310 blocked for more than 120 seconds.
? 481.420251! Not tainted 4.4.0-72-generic #93-Ubuntu
? 481.420252! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
? 481.420291! INFO: task irqbalance:2416 blocked for more than 120 seconds.
? 481.420292! Not tainted 4.4.0-72-generic #93-Ubuntu
? 481.420293! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

Dump:

------- Comment From jacobi@de.ibm.com 2017-04-06 11:53 EDT-------
Same procedure as every kernel, /var/log/syslog:

Apr  6 16:17:29 mclint multipathd[881]: mpatha: sda - tur checker timed out
Apr  6 16:17:29 mclint multipathd[881]: 8:0: reinstated
Apr  6 16:17:29 mclint multipathd[881]: mpatha: sdb - tur checker timed out
Apr  6 16:17:29 mclint multipathd[881]: 8:16: reinstated
Apr  6 16:17:29 mclint multipathd[881]: mpatha: sdd - tur checker timed out
Apr  6 16:17:29 mclint multipathd[881]: 8:48: reinstated
Apr  6 16:17:29 mclint multipathd[881]: mpatha: sdc - tur checker timed out
Apr  6 16:17:29 mclint multipathd[881]: 8:32: reinstated

/dev/sclp_line0:

?  361.418628! INFO: task kworker/1:4:860 blocked for more than 120 seconds.
?  361.418635!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  361.418637! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  361.418722! INFO: task cpuplugd:2310 blocked for more than 120 seconds.
?  361.418723!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  361.418723! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  361.418766! INFO: task irqbalance:2416 blocked for more than 120 seconds.
?  361.418767!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  361.418768! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  361.418806! INFO: task kworker/0:2H:3403 blocked for more than 120 seconds.
?  361.418807!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  361.418808! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  361.418990! INFO: task kworker/0:9:4449 blocked for more than 120 seconds.
?  361.418991!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  361.418992! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.420013! INFO: task kworker/1:4:860 blocked for more than 120 seconds.
?  481.420020!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  481.420021! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.420091! INFO: task systemd-timesyn:1766 blocked for more than 120 seconds
.
?  481.420093!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  481.420093! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.420136! INFO: task rs:main Q:Reg:2023 blocked for more than 120 seconds.
?  481.420137!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  481.420138! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.420250! INFO: task cpuplugd:2310 blocked for more than 120 seconds.
?  481.420251!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  481.420252! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
?  481.420291! INFO: task irqbalance:2416 blocked for more than 120 seconds.
?  481.420292!       Not tainted 4.4.0-72-generic #93-Ubuntu
?  481.420293! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.

Dump:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-72-generic
DUMPFILE: mclint_20170406_kernel_4_4_0-72_without_openafs.dump
CPUS: 1
DATE: Thu Apr  6 16:43:18 2017
UPTIME: 00:29:55
LOAD AVERAGE: 9.99, 9.56, 7.51
TASKS: 403
NODENAME: mclint
RELEASE: 4.4.0-72-generic
VERSION: #93-Ubuntu SMP Fri Mar 31 14:06:48 UTC 2017
MACHINE: s390x  (unknown Mhz)
MEMORY: 7.8 GB
PANIC: ""
PID: 0
COMMAND: "swapper/0"
TASK: bad528  [THREAD_INFO: b78000]
CPU: 0
STATE: TASK_RUNNING
INFO: no panic task found

The bz2-compressed dump is already uploaded to Box.

Revision history for this message

Frank Heimes (fheimes) wrote on 2017-04-10:

#18

I copied over the latest 4.4-72 dump from IBM Box to our Canonical private file share into my home:
/~fheimes/mclint_20170406_kernel_4_4_0-72_without_openafs.dump.bz2
also reachable via https ...

Joseph Salisbury (jsalisbury) on 2017-04-18

Changed in linux (Ubuntu):
status:	New → Triaged

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-04-24:

#19

Is there an update on this bug?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-04-26:

#20

@<email address hidden> It would be good to know if this bug is already fixed in the mainline kernel. Would it be possible for you to test 4.11-rc8? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc8/

Joseph Salisbury (jsalisbury) on 2017-06-05

tags:

added: kernel-da-key
removed: kernel-key

Revision history for this message

bugproxy (bugproxy) wrote on 2017-06-06:

#21

------- Comment From <email address hidden> 2017-06-06 12:31 EDT-------
@jsalisbury

Sorry for the late late late answer, but finally I've found time and resources to test the 4.11-rc8 Kernel. And yes, this one looks promising :-)

I was able to build OpenAFS with the pbuilder environment and also trying to build the firefox packages with pbuilder did not drive me into the hang scenario I usually face.
I'll do some more tests but this looks promising.
Hmm, if the solution for this bug is Kernel 4.11 than we may have to speak out a warning to whoever is utilizing multipathing on Kernels with an earlier version.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-06-06:

#22

------- Comment From <email address hidden> 2017-06-06 12:46 EDT-------
@<email address hidden>

Just to prevent confusion, its probably just a typo, but I think you mean multi-queue (blk-mq and scsi-mq) - thats a different feature. Multipathing (dm-multipath) should certainly work regardless.

Frank Heimes (fheimes) on 2017-06-06

Changed in ubuntu-z-systems:
status:	New → Triaged

Revision history for this message

bugproxy (bugproxy) wrote on 2017-06-07:

#23

------- Comment From <email address hidden> 2017-06-07 07:59 EDT-------
Hello Benjamin,

you're right and let me rephrase in more detail:

The multiqueue feature is questioned here, but the feature has an impact to multipathing, because Ubuntu-Xenial boots up and has the multiqueue feature turned on as default(!!). You can run "multipathing" but you will have to turn off the multiqueue feature explicitely if you don't want to potentially run into the hang scenario described here.
And my recommendation would be to turn the "multiqueue" feature off by default for all Kernel versions prior to 4.11 .... that's what I wanted to express with my previous post.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-06-08:

#24

------- Comment From <email address hidden> 2017-06-08 10:28 EDT-------
I shut down the test system I used for the repros here yesterday. And all of a sudden during the shutdown process the system ran into a hang scenario again. I don't know whether that hang is related to the problem addressed here (blk_mq()), but I took a dump and uploaded it to Box as a precaution though ->

mclint_20170607_kernel_4_11_0-041100rc8_without_openafs.dump.bz2

General dump info:
Dump format........: s390mv
Version............: 5
Dump created.......: Wed, 07 Jun 2017 16:56:28 +0200
Dump ended.........: Wed, 07 Jun 2017 16:57:06 +0200
Dump CPU ID........: 9efc729648000
UTS node name......: mclint
UTS kernel release.: 4.11.0-041100rc8-generic
UTS kernel version.: #201704232131 SMP Mon Apr 24 02:10:15 UTC 2017
Build arch.........: s390x (64 bit)
System arch........: s390x (64 bit)
CPU count (online).: 2
CPU count (real)...: 4
Dump memory range..: 8192 MB
Real memory range..: 8192 MB

Revision history for this message

Frank Heimes (fheimes) wrote on 2017-06-08:

#25

I copied over the latest 4_11_0-041100rc8 dump from IBM Box to our Canonical private file share into my home:
~fheimes/mclint_20170607_kernel_4_11_0-041100rc8_without_openafs.dump.bz2

Joseph Salisbury (jsalisbury) on 2017-06-12

tags:

added: kernel-key
removed: kernel-da-key

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-07-07:

#26

@<email address hidden>

Can you confirm if v4.11-rc8 fixed the bug or not? Per comment #24 it ts un-clear if it does or not.

If we find that it is fixed in the v4.11-rc8 kernel, or any other newer kernel, we can perform a "Reverse" bisect to identify the commit that fixes the bug, then backport it to prior releases.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-10:

#27

------- Comment From <email address hidden> 2017-07-10 04:34 EDT-------
@jsalisbury:

I tried the kernel and didn't experience a hang as the system ran, but I ran into a hang when I shut the system down after my test. The dump of that very hang was uploaded from me to our Box account and Frank Heimes downloaded it to forward it to the Kernel team.
I'd like to have their (the kernel team's) statement whether in the dump they see a relation to blk_mq() or not before we close this item!

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-07-10:

#28

Thanks for the update. Can you test v4.11-rc7? It is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc7/

We can perform a "Reverse" kernel bisect if we can identify the last bad kernel version and the first good one.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-12:

#29

------- Comment From <email address hidden> 2017-07-12 10:45 EDT-------
@jsalisbury:

Pardon to object in some respect, but I can't get the point why I should try another 4.11rc-Kernel and what you refer to with a "bad kernel" and a "good kernel".
The 4.11rc8 kernel seemed to be fine and I just got into a hang when the system was shut down by me. I just want to know whether that hang is related to blk_mq() or not and one glance into the crash-dump I uploaded to box will answer that question. I could also look into the crash-dump by myself, I just don't have the debug symbols for kernel 4.11rc8-s390x.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-13:

#30

------- Comment From <email address hidden> 2017-07-13 10:26 EDT-------
Just wanted to try the 4.11rc7 kernel and my DKMS-OpenAFS packages don't build on it. I already experienced this once with a package from the kernel-ppa repository:

root@mclint:~# file /usr/src/linux-headers-4.11.0-041100rc7-generic/arch/s390/tools/gen_facilities
/usr/src/linux-headers-4.11.0-041100rc7-generic/arch/s390/tools/gen_facilities: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=4590a42ae8b1bf9b2bd8d05e332e59bc7f47aa93, not stripped

You put x86-ELFs in your s390x linux-headers packages!!!! I need appropriate linux-headers packages to make repros.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-07-14:

#31

The purpose of testing v4.12-rc7 is to narrow down that last kernel version that had the hang bug(The bad kernel) and the first kernel version that did not(The good kernel). This will allow us to identify the exact commit that fixes the hang bug. This can be accomplished by performing a "Reverse" bisect[0]. Once we know the commit that fixes the bug, we can SRU it to all the previous Ubuntu releases.

Are you not able to test for the hang bug without compiling the DKMS-OpenAFS packages? If so, did they compile okay when you tested v4.12-rc8?

[0] https://wiki.ubuntu.com/Kernel/KernelBisection

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-17:

#32

------- Comment From <email address hidden> 2017-07-17 12:10 EDT-------
@jsalisbury:

To get your point: If 4.11rc8 was a "good kernel" and 4.11rc7 the last "bad kernel" you could try to get that very diff that fixes this problem and could apply it to the stable 4.4-Kernel line, I agree that's a feasible approach.
However, we must first make sure that 4.11rc8 is a "good kernel" and thatfore someone must look into the dumps I uploaded.
And the arch specific linux-headers-x.x.x-x-generic packages must be fixed for s390x. Without that package I can't build my DKMS-packages.

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-08-22:

#33

I've tried to take a look at at the dump files but the BOX link referenced in comment #4 doesn't work for me. Either I don't have access or it has been removed.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-08-22:

#34

------- Comment From <email address hidden> 2017-08-22 03:42 EDT-------
@Juerg. Now you should have access to the box folder...

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-08-29:

#35

Sorry for the late reply, I'm just getting around looking at this. Yes, I do have access to the folder now, thanks!

Questions: Is this setup currently working with another distribution and you're just experiencing issues when running Ubuntu? If so, what's that other distro and kernel? Also, can you give me a high-level overview of your storage architecture?

Revision history for this message

bugproxy (bugproxy) wrote on 2017-08-30:

#36

------- Comment From <email address hidden> 2017-08-30 05:15 EDT-------
This function worked with SLES 12 SP2 kernel 4.4.74.

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-09-02:

#37

Are you running specific CPU (un)plug tests in parallel with pdebuild? Can you post the contents of /etc/cpuplugd.conf?

Revision history for this message

bugproxy (bugproxy) wrote on 2017-09-04:

#38

------- Comment From <email address hidden> 2017-09-04 04:52 EDT-------
no specific cpu (un)plug tests where executed....

Revision history for this message

bugproxy (bugproxy) wrote on 2017-09-04: cpuplugd.conf

#39

cpuplugd.conf Edit (5.4 KiB, application/octet-stream)

------- Comment (attachment only) From <email address hidden> 2017-09-04 04:51 EDT-------

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-09-04:

#40

I'm able to reproduce the issue locally.

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-09-06:

#41

Test passes with 4.13 and 4.12 and SUSE's SLE12 SP2 4.4.21-69-generic kernel.
Test fails with 4.11.

Changed in linux (Ubuntu):
assignee:	nobody → Juerg Haefliger (juergh)

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-09-11:

#42

The difference between the Xenial and SUSE kernel is that Xenial has:
  CONFIG_SCSI_MQ_DEFAULT=y
  CONFIG_DM_MQ_DEFAULT=y
but SUSE:
  # CONFIG_SCSI_MQ_DEFAULT is not set
  # CONFIG_DM_MQ_DEFAULT is not set

If I disable blk-mq in the Xenial kernel, the test passes.

The easiest 'fix' would be to simply disable blk-mq. This can be accomplished via the kernel commandline parameters: scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0.

I also noticed that s390x is the only architecture where these options are enabled in the Xenial kernel. Is there a specific requirement for this?

Joseph Salisbury (jsalisbury) on 2017-09-11

tags:

added: kernel-da-key
removed: kernel-key

Revision history for this message

bugproxy (bugproxy) wrote on 2017-09-11: Comment bridged from LTC Bugzilla

#43

------- Comment From <email address hidden> 2017-09-11 09:52 EDT-------
No specific requirement for s390 know, for enabled this config options.
They can be set to N....

Revision history for this message

bugproxy (bugproxy) wrote on 2017-09-11:

#44

------- Comment From <email address hidden> 2017-09-11 10:46 EDT-------
Ok, I think then the discussion is back at the point I addressed with my comment from June 6th (comment 36 or 38)! If the Multi-Queue feature doesn't work with Kernel 4.4 yet, then you must not deliver it (that's exactly why SUSE did not activate that feature in their 4.4-Kernel package)!
And then again with Kernel 4.11 or 4.12 you can turn that option for those kernel packages on as finally there the feature seems to be mature and can be rolled out.

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-09-12:

#45

Ok, we will turn these options off for Xenial 4.4 and bring s390x in line with the other architectures. Note that we *think* that the reason that they're enabled for s390x is that we initially received a config file from IBM that we used as a base.

Revision history for this message

Juerg Haefliger (juergh) wrote on 2017-09-13:

#46

For the record, the following commit fixes the issue for later kernels (>4.11). What we're seeing with the 4.4 kernel is most likely a different issue though.

commit ba74b6f7fcc07355d087af6939712eed4a454821 (refs/bisect/new)
Author: Christoph Hellwig <email address hidden>
Date: Thu Aug 24 18:07:02 2017 +0200

virtio_pci: fix cpu affinity support

    Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
    virtqueues"") removed the adjustment of the pre_vectors for the virtio
    MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
    allow drivers to request IRQ affinity when creating VQs"). This will
    lead to an incorrect assignment of MSI-X vectors, and potential
    deadlocks when offlining cpus.

    Signed-off-by: Christoph Hellwig <email address hidden>
    Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
    Reported-by: YASUAKI ISHIMATSU <email address hidden>
    Cc: <email address hidden>
    Signed-off-by: Michael S. Tsirkin <email address hidden>

@@ -126,10 +127,13 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
GFP_KERNEL))
goto error;

+ if (desc) {
+ flags |= PCI_IRQ_AFFINITY;
+ desc->pre_vectors++; /* virtio config vector */
+ }
+
        err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
- nvectors, PCI_IRQ_MSIX |
- (desc ? PCI_IRQ_AFFINITY : 0),
- desc);
+ nvectors, flags, desc);
        if (err < 0)
                goto error;
        vp_dev->msix_enabled = 1;

For the record, the following commit fixes the issue for later kernels (>4.11). What we're seeing with the 4.4 kernel is most likely a different issue though.

commit ba74b6f7fcc07355d087af6939712eed4a454821 (refs/bisect/new)
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Aug 24 18:07:02 2017 +0200

virtio_pci: fix cpu affinity support
    
    Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
    virtqueues"") removed the adjustment of the pre_vectors for the virtio
    MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
    allow drivers to request IRQ affinity when creating VQs"). This will
    lead to an incorrect assignment of MSI-X vectors, and potential
    deadlocks when offlining cpus.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
    Reported-by: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 007a4f366086..1c4797e53f68 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -107,6 +107,7 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
 {
        struct virtio_pci_device *vp_dev = to_vp_device(vdev);
        const char *name = dev_name(&vp_dev->vdev.dev);
+       unsigned flags = PCI_IRQ_MSIX;
        unsigned i, v;
        int err = -ENOMEM;
 
@@ -126,10 +127,13 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
                                        GFP_KERNEL))
                        goto error;
 
+       if (desc) {
+               flags |= PCI_IRQ_AFFINITY;
+               desc->pre_vectors++; /* virtio config vector */
+       }
+
        err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
-                                            nvectors, PCI_IRQ_MSIX |
-                                            (desc ? PCI_IRQ_AFFINITY : 0),
-                                            desc);
+                                            nvectors, flags, desc);
        if (err < 0)
                goto error;
        vp_dev->msix_enabled = 1;

Kleber Sacilotto de Souza (kleber-souza) on 2017-09-19

Changed in linux (Ubuntu Xenial):
status:	New → Fix Committed

Frank Heimes (fheimes) on 2017-09-19

Changed in linux (Ubuntu):
status:	Triaged → Fix Committed
Changed in ubuntu-z-systems:
status:	Triaged → Fix Committed

Revision history for this message

bugproxy (bugproxy) wrote on 2017-09-22:

#47

------- Comment From <email address hidden> 2017-09-22 04:36 EDT-------
Just checked kernel 4.4.0-96. blk_mq() is still activated as default:

root@xxxxx:~# systool -v -m dm_mod
Module = "dm_mod"

Attributes:
uevent = <store method only>

Parameters:
reserved_bio_based_ios= "16"
reserved_rq_based_ios= "256"
stats_current_allocated_bytes= "0"
use_blk_mq = "Y"
root@xxxxx:~# systool -v -m scsi_mod
Module = "scsi_mod"

Attributes:
uevent = <store method only>

Parameters:
default_dev_flags = "0"
eh_deadline = "-1"
inq_timeout = "20"
max_luns = "512"
scan = "async"
scsi_logging_level = "0"
use_blk_mq = "Y"

Revision history for this message

Kleber Sacilotto de Souza (kleber-souza) wrote on 2017-09-25:

#48

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-xenial

Revision history for this message

bugproxy (bugproxy) wrote on 2017-09-26:

#49

------- Comment From <email address hidden> 2017-09-26 07:41 EDT-------
Ok, just checked kernel 4.4.0-97 ... that looks much better:

root@xxxxx:~# uname -a
Linux mclint 4.4.0-97-generic #120-Ubuntu SMP Tue Sep 19 17:27:01 UTC 2017 s390x s390x s390x GNU/Linux
root@xxxxx:~# systool -v -m scsi_mod
Module = "scsi_mod"

Attributes:
uevent = <store method only>

Parameters:
default_dev_flags = "0"
eh_deadline = "-1"
inq_timeout = "20"
max_luns = "512"
scan = "async"
scsi_logging_level = "0"
use_blk_mq = "N"
root@xxxxx:~# systool -v -m dm_mod
Module = "dm_mod"

Attributes:
uevent = <store method only>

Parameters:
reserved_bio_based_ios= "16"
reserved_rq_based_ios= "256"
stats_current_allocated_bytes= "0"
use_blk_mq = "N"

blk_mq() is now turned off by default and that was our main concern! On top, I also started two "big" pdebuild processes (firefox) that so far had the potential to drive the system right into the hang scenario that was the origin cause to write this ticket. The build did not succeed, but the point is that the system did not run into the typical hang -> so I think you can consider this problem being solved!

Revision history for this message

Kleber Sacilotto de Souza (kleber-souza) wrote on 2017-09-29:

#50

Hi @jacobi,

Thank you very much for verifying the fix!

Kleber

tags:

added: verification-done-xenial
removed: verification-needed-xenial

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-10-10:

#51

Download full text (7.8 KiB)

This bug was fixed in the package linux - 4.4.0-97.120

---------------
linux (4.4.0-97.120) xenial; urgency=low

* linux: 4.4.0-97.120 -proposed tracker (LP: #1718149)

* blk-mq: possible deadlock on CPU hot(un)plug (LP: #1670634)
- [Config] s390x -- disable CONFIG_{DM, SCSI}_MQ_DEFAULT

  * Xenial update to 4.4.87 stable release (LP: #1715678)
    - irqchip: mips-gic: SYNC after enabling GIC region
    - i2c: ismt: Don't duplicate the receive length for block reads
    - i2c: ismt: Return EMSGSIZE for block reads with bogus length
    - ceph: fix readpage from fscache
    - cpumask: fix spurious cpumask_of_node() on non-NUMA multi-node configs
    - cpuset: Fix incorrect memory_pressure control file mapping
    - alpha: uapi: Add support for __SANE_USERSPACE_TYPES__
    - CIFS: remove endian related sparse warning
    - wl1251: add a missing spin_lock_init()
    - xfrm: policy: check policy direction value
    - drm/ttm: Fix accounting error when fail to get pages for pool
    - kvm: arm/arm64: Fix race in resetting stage2 PGD
    - kvm: arm/arm64: Force reading uncached stage2 PGD
    - epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
    - crypto: algif_skcipher - only call put_page on referenced and used pages
    - Linux 4.4.87

  * Xenial update to 4.4.86 stable release (LP: #1715430)
    - scsi: isci: avoid array subscript warning
    - ALSA: au88x0: Fix zero clear of stream->resources
    - btrfs: remove duplicate const specifier
    - i2c: jz4780: drop superfluous init
    - gcov: add support for gcc version >= 6
    - gcov: support GCC 7.1
    - lightnvm: initialize ppa_addr in dev_to_generic_addr()
    - p54: memset(0) whole array
    - lpfc: Fix Device discovery failures during switch reboot test.
    - arm64: mm: abort uaccess retries upon fatal signal
    - x86/io: Add "memory" clobber to insb/insw/insl/outsb/outsw/outsl
    - arm64: fpsimd: Prevent registers leaking across exec
    - scsi: sg: protect accesses to 'reserved' page array
    - scsi: sg: reset 'res_in_use' after unlinking reserved array
    - drm/i915: fix compiler warning in drivers/gpu/drm/i915/intel_uncore.c
    - Linux 4.4.86

This bug was fixed in the package linux - 4.4.0-97.120

---------------
linux (4.4.0-97.120) xenial; urgency=low

* linux: 4.4.0-97.120 -proposed tracker (LP: #1718149)

* blk-mq: possible deadlock on CPU hot(un)plug (LP: #1670634)
    - [Config] s390x -- disable CONFIG_{DM, SCSI}_MQ_DEFAULT

* Xenial update to 4.4.85 stable release (LP: #1714298)
    - af_key: do not use GFP_KERNEL in atomic contexts
    - dccp: purge write queue in dccp_destroy_sock()
    - dccp: defer ccid_hc_tx_delete() at dismantle time
    - ipv4: fix NULL dereference in free_fib_info_rcu()
    - net_sched/sfq: update hierarchical backlog when drop packet
    - ipv4: better IP_MAX_MTU enforcement
    - sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
    - tipc: fix use-after-free
    - ipv6: reset fn->rr_ptr when replacing route
    - ipv6: repair fib6 tree in failure case
    - tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
    - irda: do not leak initialized list.dev to userspace
    - net: sched: fix NULL pointer dereference when action calls some targets
    - net_sched: fix order of queue length updates in qdisc_replace()
    - mei: me: add broxton pci device ids
    - mei: me: add lewisburg device ids
    - Input: trackpoint - add new trackpoint firmware ID
    - Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310
    - ALSA: core: Fix unexpected error at replacing user TLV
    - ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978)
    - ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses
    - i2c: designware: Fix system suspend
    - drm: Release driver tracking before making the object available again
    - drm/atomic: If the atomic check fails, return its value first
    - drm: rcar-du: lvds: Fix PLL frequency-related configuration
    - drm: rcar-du: lvds: Rename PLLEN bit to PLLON
    - drm: rcar-du: Fix crash in encoder failure error path
    - drm: rcar-du: Fix display timing controller parameter
    - drm: rcar-du: Fix H/V sync signal polarity configuration
    - tracing: Fix freeing of filter in create_filter() when set_str is false
    - cifs: Fix df output for users with quota limits
    - cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
    - nfsd: Limit end of page list when decoding NFSv4 WRITE
    - perf/core: Fix group {cpu,task} validation
    - Bluetooth: hidp: fix possible might sleep error in hidp_session_thread
    - Bluetooth: cmtp: fix possible might sleep error in cmtp_session
    - Bluetooth: bnep: fix possible might sleep error in bnep_session
    - binder: use group leader instead of open thread
    - binder: Use wake up hint for synchronous transactions.
    - ANDROID: binder: fix proc->tsk check.
    - iio: imu: adis16480: Fix acceleration scale factor for adis16480
    - iio: hid-sensor-trigger: Fix the race with user space powering up sensors
    - staging: rtl8188eu: add RNX-N150NUB support
    - ASoC: simple-card: don't fail if sysclk setting is not supported
    - ASoC: rsnd: disable SRC.out only when stop timing
    - ASoC: rsnd: avoid pointless loop in rsnd_mod_interrupt()
    - ASoC: rsnd: Add missing initialization of ADG req_rate
    - ASoC: rsnd: ssi: 24bit data needs right-aligned settings
    - ASoC: rsnd: don't call update callback if it was NULL
    - ntb_transport: fix qp count bug
    - ntb_transport: fix bug calculating num_qps_mw
    - ACPI: ioapic: Clear on-stack resource before using it
    - ACPI / APEI: Add missing synchronize_rcu() on NOTIFY_SCI removal
    - Linux 4.4.85

* Xenial update to 4.4.84 stable release (LP: #1713729)
    - audit: Fix use after free in audit_remove_watch_rule()
    - parisc: pci memory bar assignment fails with 64bit kernels on dino/cujo
    - crypto: x86/sha1 - Fix reads beyond the number of blocks passed
    - Input: elan_i2c - Add antoher Lenovo ACPI ID for upcoming Lenovo NB
    - ALSA: seq: 2nd attempt at fixing race creating a queue
    - Revert "UBUNTU: SAUCE: (no-up) ALSA: usb-audio: Add quirk for sennheiser
      officerunner"
    - ALSA: usb-audio: Apply sample rate quirk to Sennheiser headset
    - ALSA: usb-audio: Add mute TLV for playback volumes on C-Media devices
    - mm/mempolicy: fix use after free when calling get_mempolicy
    - xen: fix bio vec merging
    - x86/asm/64: Clear AC on NMI entries
    - irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
    - irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup()
    - Sanitize 'move_pages()' permission checks
    - pids: make task_tgid_nr_ns() safe
    - perf/x86: Fix LBR related crashes on Intel Atom
    - usb: optimize acpi companion search for usb port devices
    - usb: qmi_wwan: add D-Link DWM-222 device ID
    - Linux 4.4.84

* Intel i40e PF reset due to incorrect MDD detection (LP: #1713553)
    - i40e: Limit TX descriptor count in cases where frag size is greater than 16K

* Neighbour confirmation broken, breaks ARP cache aging (LP: #1715812)
    - sock: add sk_dst_pending_confirm flag
    - net: add dst_pending_confirm flag to skbuff
    - sctp: add dst_pending_confirm flag
    - tcp: replace dst_confirm with sk_dst_confirm
    - net: add confirm_neigh method to dst_ops
    - net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP
    - net: pending_confirm is not used anymore

* CVE-2017-14106
    - tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0

* [CIFS] Fix maximum SMB2 header size (LP: #1713884)
    - CIFS: Fix maximum SMB2 header size

* Middle button of trackpoint doesn't work (LP: #1715271)
    - Input: trackpoint - assume 3 buttons when buttons detection fails

* kernel BUG at /build/linux-lts-xenial-_hWfOZ/linux-lts-
    xenial-4.4.0/security/apparmor/include/context.h:69! (LP: #1626984)
    - SAUCE: fix oops when disabled and module parameters, are accessed

* Touchpad not detected (LP: #1708852)
    - Input: elan_i2c - add ELAN0608 to the ACPI table

-- Kleber Sacilotto de Souza <kleber.souza@canonical.com>  Tue, 19 Sep 2017 17:55:11 +0200

Changed in linux (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Frank Heimes (fheimes) on 2017-10-10

Changed in linux (Ubuntu):
status:	Fix Committed → Fix Released
Changed in ubuntu-z-systems:
status:	Fix Committed → Fix Released

Revision history for this message

bugproxy (bugproxy) wrote on 2017-10-11:

#52

------- Comment From <email address hidden> 2017-10-11 03:36 EDT-------
IBM Bugzilla status -> closed, Fix Released within Xenial

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

blk-mq: possible deadlock on CPU hot(un)plug

Bug Description

CVE References

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package