ThunderX: soft lockup on 4.8+ kernels
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Triaged
|
High
|
Unassigned | ||
Yakkety |
Won't Fix
|
High
|
Unassigned | ||
Zesty |
Triaged
|
High
|
Unassigned |
Bug Description
I have been trying to easily reproduce this for days.
We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-
In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.).
However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it:
$ apt-get install stress-ng
$ stress-ng --hdd 1024
We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]:
[ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds.
[ 726.077908] Tainted: G W I 4.8.0-41-generic #44~16.04.1-Ubuntu
[ 726.085850] "echo 0 > /proc/sys/
[ 726.094383] kworker/0:1 D ffff0000080861bc 0 312 2 0x00000000
[ 726.094401] Workqueue: events vmstat_shepherd
[ 726.094404] Call trace:
[ 726.094411] [<ffff000008086
[ 726.094418] [<ffff000008985
[ 726.094421] [<ffff000008985
[ 726.094425] [<ffff000008985
[ 726.094428] [<ffff000008987
[ 726.094431] [<ffff000008987
[ 726.094437] [<ffff0000080c5
[ 726.094440] [<ffff00000820c
[ 726.094446] [<ffff0000080e1
[ 726.094449] [<ffff0000080e1
[ 726.094453] [<ffff0000080e8
[ 726.094456] [<ffff000008083
Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them.
On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops.
[1] http://
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Mar 13 19:27 seq
crw-rw---- 1 root audio 116, 33 Mar 13 19:27 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.5
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
MachineType: GIGABYTE R120-T30
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=vt220
PATH=(custom, no user)
XDG_RUNTIME_
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 1.157.8
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial
Uname: Linux 4.8.0-41-generic aarch64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
_MarkForUpload: True
dmi.bios.date: 11/22/2016
dmi.bios.vendor: GIGABYTE
dmi.bios.version: T22
dmi.board.
dmi.board.name: MT30-GS0
dmi.board.vendor: GIGABYTE
dmi.board.version: 01234567
dmi.chassis.
dmi.chassis.type: 17
dmi.chassis.vendor: GIGABYTE
dmi.chassis.
dmi.modalias: dmi:bvnGIGABYTE
dmi.product.name: R120-T30
dmi.product.
dmi.sys.vendor: GIGABYTE
apport information