2017-03-13 20:13:10 |
Alexandru Avadanii |
bug |
|
|
added bug |
2017-03-13 20:17:22 |
Alexandru Avadanii |
bug |
|
|
added subscriber Ciprian Barbu |
2017-03-13 20:17:32 |
Alexandru Avadanii |
bug |
|
|
added subscriber Paul Vaduva |
2017-03-13 20:21:52 |
Alexandru Avadanii |
tags |
|
apport-collected xenial |
|
2017-03-13 20:21:53 |
Alexandru Avadanii |
description |
I have been trying to easily reproduce this for days.
We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-generic-hwe-16.04 (4.8).
In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.).
However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it:
$ apt-get install stress-ng
$ stress-ng --hdd 1024
We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]:
[ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds.
[ 726.077908] Tainted: G W I 4.8.0-41-generic #44~16.04.1-Ubuntu
[ 726.085850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.094383] kworker/0:1 D ffff0000080861bc 0 312 2 0x00000000
[ 726.094401] Workqueue: events vmstat_shepherd
[ 726.094404] Call trace:
[ 726.094411] [<ffff0000080861bc>] __switch_to+0x94/0xa8
[ 726.094418] [<ffff0000089854f4>] __schedule+0x224/0x718
[ 726.094421] [<ffff000008985a20>] schedule+0x38/0x98
[ 726.094425] [<ffff000008985d84>] schedule_preempt_disabled+0x14/0x20
[ 726.094428] [<ffff000008987644>] __mutex_lock_slowpath+0xd4/0x168
[ 726.094431] [<ffff000008987730>] mutex_lock+0x58/0x70
[ 726.094437] [<ffff0000080c552c>] get_online_cpus+0x44/0x70
[ 726.094440] [<ffff00000820ca24>] vmstat_shepherd+0x3c/0xe8
[ 726.094446] [<ffff0000080e1c60>] process_one_work+0x150/0x478
[ 726.094449] [<ffff0000080e1fd8>] worker_thread+0x50/0x4b8
[ 726.094453] [<ffff0000080e8eac>] kthread+0xec/0x100
[ 726.094456] [<ffff000008083690>] ret_from_fork+0x10/0x40
Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them.
On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops.
[1] http://paste.ubuntu.com/24172516/ |
I have been trying to easily reproduce this for days.
We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-generic-hwe-16.04 (4.8).
In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.).
However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it:
$ apt-get install stress-ng
$ stress-ng --hdd 1024
We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]:
[ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds.
[ 726.077908] Tainted: G W I 4.8.0-41-generic #44~16.04.1-Ubuntu
[ 726.085850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.094383] kworker/0:1 D ffff0000080861bc 0 312 2 0x00000000
[ 726.094401] Workqueue: events vmstat_shepherd
[ 726.094404] Call trace:
[ 726.094411] [<ffff0000080861bc>] __switch_to+0x94/0xa8
[ 726.094418] [<ffff0000089854f4>] __schedule+0x224/0x718
[ 726.094421] [<ffff000008985a20>] schedule+0x38/0x98
[ 726.094425] [<ffff000008985d84>] schedule_preempt_disabled+0x14/0x20
[ 726.094428] [<ffff000008987644>] __mutex_lock_slowpath+0xd4/0x168
[ 726.094431] [<ffff000008987730>] mutex_lock+0x58/0x70
[ 726.094437] [<ffff0000080c552c>] get_online_cpus+0x44/0x70
[ 726.094440] [<ffff00000820ca24>] vmstat_shepherd+0x3c/0xe8
[ 726.094446] [<ffff0000080e1c60>] process_one_work+0x150/0x478
[ 726.094449] [<ffff0000080e1fd8>] worker_thread+0x50/0x4b8
[ 726.094453] [<ffff0000080e8eac>] kthread+0xec/0x100
[ 726.094456] [<ffff000008083690>] ret_from_fork+0x10/0x40
Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them.
On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops.
[1] http://paste.ubuntu.com/24172516/
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Mar 13 19:27 seq
crw-rw---- 1 root audio 116, 33 Mar 13 19:27 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.5
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
MachineType: GIGABYTE R120-T30
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=vt220
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.8.0-41-generic root=/dev/mapper/os-root ro console=tty0 console=ttyS0,115200 console=ttyAMA0,115200 net.ifnames=1 biosdevname=0 rootdelay=90 nomodeset quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 4.8.0-41.44~16.04.1-generic 4.8.17
RelatedPackageVersions:
linux-restricted-modules-4.8.0-41-generic N/A
linux-backports-modules-4.8.0-41-generic N/A
linux-firmware 1.157.8
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial
Uname: Linux 4.8.0-41-generic aarch64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
_MarkForUpload: True
dmi.bios.date: 11/22/2016
dmi.bios.vendor: GIGABYTE
dmi.bios.version: T22
dmi.board.asset.tag: 01234567890123456789AB
dmi.board.name: MT30-GS0
dmi.board.vendor: GIGABYTE
dmi.board.version: 01234567
dmi.chassis.asset.tag: 01234567890123456789AB
dmi.chassis.type: 17
dmi.chassis.vendor: GIGABYTE
dmi.chassis.version: 01234567
dmi.modalias: dmi:bvnGIGABYTE:bvrT22:bd11/22/2016:svnGIGABYTE:pnR120-T30:pvr0100:rvnGIGABYTE:rnMT30-GS0:rvr01234567:cvnGIGABYTE:ct17:cvr01234567:
dmi.product.name: R120-T30
dmi.product.version: 0100
dmi.sys.vendor: GIGABYTE |
|
2017-03-13 20:21:55 |
Alexandru Avadanii |
attachment added |
|
CRDA.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837212/+files/CRDA.txt |
|
2017-03-13 20:21:57 |
Alexandru Avadanii |
attachment added |
|
CurrentDmesg.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837213/+files/CurrentDmesg.txt |
|
2017-03-13 20:21:58 |
Alexandru Avadanii |
attachment added |
|
JournalErrors.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837214/+files/JournalErrors.txt |
|
2017-03-13 20:22:01 |
Alexandru Avadanii |
attachment added |
|
Lspci.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837215/+files/Lspci.txt |
|
2017-03-13 20:22:03 |
Alexandru Avadanii |
attachment added |
|
Lsusb.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837216/+files/Lsusb.txt |
|
2017-03-13 20:22:05 |
Alexandru Avadanii |
attachment added |
|
ProcCpuinfo.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837217/+files/ProcCpuinfo.txt |
|
2017-03-13 20:22:06 |
Alexandru Avadanii |
attachment added |
|
ProcInterrupts.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837218/+files/ProcInterrupts.txt |
|
2017-03-13 20:22:08 |
Alexandru Avadanii |
attachment added |
|
ProcModules.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837219/+files/ProcModules.txt |
|
2017-03-13 20:22:11 |
Alexandru Avadanii |
attachment added |
|
UdevDb.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837220/+files/UdevDb.txt |
|
2017-03-13 20:22:13 |
Alexandru Avadanii |
attachment added |
|
WifiSyslog.txt https://bugs.launchpad.net/bugs/1672521/+attachment/4837221/+files/WifiSyslog.txt |
|
2017-03-13 20:30:15 |
Brad Figg |
linux (Ubuntu): status |
New |
Confirmed |
|
2017-03-14 18:02:04 |
Joseph Salisbury |
linux (Ubuntu): importance |
Undecided |
High |
|
2017-03-14 18:03:37 |
Joseph Salisbury |
nominated for series |
|
Ubuntu Zesty |
|
2017-03-14 18:03:37 |
Joseph Salisbury |
bug task added |
|
linux (Ubuntu Zesty) |
|
2017-03-14 18:03:37 |
Joseph Salisbury |
nominated for series |
|
Ubuntu Yakkety |
|
2017-03-14 18:03:37 |
Joseph Salisbury |
bug task added |
|
linux (Ubuntu Yakkety) |
|
2017-03-14 18:03:49 |
Joseph Salisbury |
linux (Ubuntu Yakkety): status |
New |
Triaged |
|
2017-03-14 18:03:52 |
Joseph Salisbury |
linux (Ubuntu Zesty): status |
Confirmed |
Triaged |
|
2017-03-14 18:03:55 |
Joseph Salisbury |
linux (Ubuntu Yakkety): importance |
Undecided |
High |
|
2017-03-14 18:04:11 |
Joseph Salisbury |
tags |
apport-collected xenial |
apport-collected kernel-da-key needs-bisect xenial yakkety zesty |
|
2017-03-14 19:19:39 |
Ciprian Barbu |
attachment added |
|
dmesg.log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1672521/+attachment/4837761/+files/dmesg.log |
|
2017-03-14 19:50:34 |
Alexandru Avadanii |
attachment added |
|
ThunderX 4.11-rc1 console log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1672521/+attachment/4837770/+files/thunderx_4.11_rc1_console_log.txt |
|
2017-03-20 14:59:59 |
dann frazier |
bug |
|
|
added subscriber dann frazier |
2017-03-21 12:42:28 |
Raghuram Kota |
bug |
|
|
added subscriber Raghuram Kota |
2017-03-21 17:46:18 |
Andrew Cloke |
bug |
|
|
added subscriber Andrew Cloke |
2017-04-01 13:15:27 |
Richard |
bug |
|
|
added subscriber Richard |
2017-07-26 15:23:15 |
Andy Whitcroft |
linux (Ubuntu Yakkety): status |
Triaged |
Won't Fix |
|