MD RAID 6 Periodic Kernel Panic Stack Overflow Double-Fault

Bug #1929591 reported by Jake Staehle
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hello:
Every few days I get a kernel panic on my Ubuntu Server 20.10 box, which was recently upgraded to a Ryzen 3700X. I have 7 WD Red Pro HDDs in a RAID 6 array with Linux MD, and they're all attached to a LSI 9211-8ik PCIe card. Motherboard is currently a Gigabyte B550M Aorus Pro. My Ubuntu install is running the latest 5.8.0-53 kernel.

This is the 2nd hardware configuration with the exact same kernel panic text. Previously I had these HDDs directly connected to the SATA controller of a ASRock X570 Pro4 ATX mobo with the same 3700X. I was also previously using Ubuntu Server 20.04 LTS -- I had upgraded to 20.10 in hopes that the newer kernel would fix it, which it did not.

I had posted a whole story on StackOverflow about this journey if you're interested: https://superuser.com/questions/1615400/md-raid-6-periodic-kernel-panic-possible-kernel-bug

However, I am now convinced this is a Linux kernel bug in the MD driver.

Example 1 kernel panic:

[406005.583315] BUG: stack guard page was hit at 000000007cbff150 (stack is 000000003b7072a2..00000000dac5ed08)
[406005.583315] kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI
[406005.583315] CPU: 15 PID: 514 Comm: md0_raid6 Tainted: P OE 5.8.0-36-generic #40-Ubuntu
[406005.583316] Hardware name: Gigabyte Technology Co., Ltd. B550M AORUS PRO/B550M AORUS PRO, BIOS F1 05/19/2020
[406005.583316] RIP: 0010:slab_free_freelist_hook+0x35/0x120
[406005.583316] Code: 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 48 8b 02 4c 8b 36 48 c7 06 00 00 00 00 48 c7 02 00 00 00 00 48 85 c0 49 0f 44 c6 <48> 89 45 d0 eb 06 4c 3b 7d d0 74 5d 8b 53 20 4d 89 f7 49 8d 34 16
[406005.583316] RSP: 0018:ffffa620c06e3ff8 EFLAGS: 00010246
[406005.583317] RAX: ffff9aaf36f54720 RBX: ffff9ab34b407800 RCX: 0000000000000001
[406005.583317] RDX: ffffa620c06e4040 RSI: ffffa620c06e4038 RDI: ffff9ab34b407800
[406005.583317] RBP: ffffa620c06e4028 R08: 0000000000000001 R09: ffffffffb9c54500
[406005.583318] R10: ffff9aaf36f54fe0 R11: 0000000000000001 R12: ffffa620c06e4038
[406005.583318] R13: ffffa620c06e4040 R14: ffff9aaf36f54720 R15: ffff9ab2925cbd10
[406005.583318] FS: 0000000000000000(0000) GS:ffff9ab34edc0000(0000) knlGS:0000000000000000
[406005.583318] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[406005.583318] CR2: ffffa620c06e3fe8 CR3: 00000005d52ac000 CR4: 0000000000340ee0
[406005.583319] Call Trace:
[406005.583319] ? mempool_kfree+0xe/0x10
[406005.583319] ? kfree+0xb8/0x220
[406005.583319] ? mempool_kfree+0xe/0x10
[406005.583319] ? mempool_free+0x2f/0x80
[406005.583319] ? md_end_io+0x4b/0x70
[406005.583319] ? bio_endio+0xe6/0x150

Example 2 kernel panic with old mobo:

[161342.301305] BUG: stack guard page was hit at 00000000fc60f228 (stack is 00000000875efe77..000000003f38a379)
[161342.301306] kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI
[161342.301306] CPU: 10 PID: 465 Comm: md0_raid6 Tainted: P OE 5.8.0-33-generic #36-Ubuntu
[161342.301307] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P3.60 12/01/2020
[161342.301307] RIP: 0010:slab_free_freelist_hook+0x35/0x120
[161342.301308] Code: 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 48 8b 02 4c 8b 36 48 c7 06 00 00 00 00 48 c7 02 00 00 00 00 48 85 c0 49 0f 44 c6 <48> 89 45 d0 eb 06 4c 3b 7d d0 74 5d 8b 53 20 4d 89 f7 49 8d 34 16
[161342.301308] RSP: 0018:ffffa86b00c6fff8 EFLAGS: 00010246
[161342.301309] RAX: ffff98edc21cac40 RBX: ffff98ef0b407800 RCX: 0000000000000001
[161342.301310] RDX: ffffa86b00c70040 RSI: ffffa86b00c70038 RDI: ffff98ef0b407800
[161342.301310] RBP: ffffa86b00c70028 R08: 0000000000000001 R09: ffffffff85854500
[161342.301311] R10: ffff98edc21ca100 R11: 0000000000000001 R12: ffffa86b00c70038
[161342.301311] R13: ffffa86b00c70040 R14: ffff98edc21cac40 R15: ffff98e9b53d74d8
[161342.301311] FS: 0000000000000000(0000) GS:ffff98ef0ec80000(0000) knlGS:0000000000000000
[161342.301312] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[161342.301312] CR2: ffffa86b00c6ffe8 CR3: 00000007fa766000 CR4: 0000000000340ee0
[161342.301312] Call Trace:
[161342.301313] ? mempool_kfree+0xe/0x10
[161342.301313] ? kfree+0xb8/0x220
[161342.301313] ? mempool_kfree+0xe/0x10
[161342.301313] ? mempool_free+0x2f/0x80
[161342.301314] ? md_end_io+0x4b/0x70
[161342.301314] ? bio_endio+0xe6/0x150
[161342.301314] ? bio_chain_endio+0x2d/0x40
[161342.301315] ? md_end_io+0x5d/0x70
[161342.301315] ? bio_endio+0xe6/0x150
[161342.301315] ? bio_chain_endio+0x2d/0x40
[161342.301315] ? md_end_io+0x5d/0x70
[161342.301316] ? bio_endio+0xe6/0x150
[161342.301316] ? bio_chain_endio+0x2d/0x40
[161342.301316] ? md_end_io+0x5d/0x70
[161342.301316] ? bio_endio+0xe6/0x150
[161342.301317] ? bio_chain_endio+0x2d/0x40
[161342.301317] ? md_end_io+0x5d/0x70
[161342.301317] ? bio_endio+0xe6/0x150
[161342.301317] ? bio_chain_endio+0x2d/0x40
...
[161342.301379] ? md_end_io+0x5d/0x70
[161342.301379] ? bio_endio+0xe6/0x150
[161342.301380] ? bio_chain_endio+0x2d/0x40
[161342.301380] ? md_end_io+0x5d/0x70
[161342.301380] ? bio_endio+0xe6/0x150
[161342.301380] ? bio_ch
[161342.301381] Lost 296 message(s)!
[ 0.000000] Linux version 5.8.0-33-generic (buildd@lgw01-amd64-036) (gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0, GNU ld (GNU Binutils for Ubuntu) 2.35.1) #36-Ubuntu SMP Wed Dec 9 09:14:40 UTC 2020 (Ubuntu 5.8.0-33.36-generic 5.8.17)

I can provide newer kernel panics or other info if needed. Thanks!

ProblemType: Bug
DistroRelease: Ubuntu 20.10
Package: mdadm 4.1-5ubuntu5
ProcVersionSignature: Ubuntu 5.8.0-53.60-generic 5.8.18
Uname: Linux 5.8.0-53-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu50.5
Architecture: amd64
CasperMD5CheckResult: pass
Date: Tue May 25 12:11:44 2021
InstallationDate: Installed on 2020-11-23 (182 days ago)
InstallationMedia: Ubuntu-Server 20.10 "Groovy Gorilla" - Release amd64 (20201022)
MachineType: Gigabyte Technology Co., Ltd. B550M AORUS PRO
ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.8.0-53-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro console=tty1 console=ttyS0,115200 processor.max_cstate=5 rcu_nocbs=0-15
SourcePackage: mdadm
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 05/19/2020
dmi.bios.release: 5.17
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: F1
dmi.board.asset.tag: Default string
dmi.board.name: B550M AORUS PRO
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrF1:bd05/19/2020:br5.17:svnGigabyteTechnologyCo.,Ltd.:pnB550MAORUSPRO:pvrDefaultstring:rvnGigabyteTechnologyCo.,Ltd.:rnB550MAORUSPRO:rvrx.x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: Default string
dmi.product.name: B550M AORUS PRO
dmi.product.sku: Default string
dmi.product.version: Default string
dmi.sys.vendor: Gigabyte Technology Co., Ltd.
etc.blkid.tab: Error: [Errno 2] No such file or directory: '/etc/blkid.tab'
mtime.conffile..etc.apport.crashdb.conf: 2020-11-24T13:52:10.563946

Revision history for this message
Jake Staehle (staehle) wrote :
Revision history for this message
Jake Staehle (staehle) wrote :
Download full text (13.4 KiB)

Hey so this is totally still happening on kernel 5.8.0-53. Just got this serial console capture:

babylon login: [1457468.880947] BUG: stack guard page was hit at 000000007aef1a4a (stack is 00000000af9c61cd..000000007ccda653)
[1457468.880948] kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI
[1457468.880948] CPU: 3 PID: 512 Comm: md0_raid6 Tainted: P OE 5.8.0-53-generic #60-Ubuntu
[1457468.880949] Hardware name: Gigabyte Technology Co., Ltd. B550M AORUS PRO/B550M AORUS PRO, BIOS F13h 04/23/2021
[1457468.880949] RIP: 0010:slab_free_freelist_hook+0x35/0x120
[1457468.880950] Code: 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 48 8b 02 4c 8b 36 48 c7 06 00 00 00 00 48 c7 02 00 00 00 00 48 85 c0 49 0f 44 c6 <48> 89 45 d0 eb 06 4c 3b 7d d0 74 5d 8b 53 20 4d 89 f7 49 8d 34 16
[1457468.880951] RSP: 0018:ffffbcda805efff8 EFLAGS: 00010246
[1457468.880952] RAX: ffff9bfb8ccc42a0 RBX: ffff9bfcdb407800 RCX: 0000000000000001
[1457468.880952] RDX: ffffbcda805f0040 RSI: ffffbcda805f0038 RDI: ffff9bfcdb407800
[1457468.880953] RBP: ffffbcda805f0028 R08: 0000000000000001 R09: ffffffff90841600
[1457468.880953] R10: ffff9bfb8ccc4f40 R11: 0000000000000001 R12: ffffbcda805f0038
[1457468.880953] R13: ffffbcda805f0040 R14: ffff9bfb8ccc42a0 R15: ffff9bf766967940
[1457468.880954] FS: 0000000000000000(0000) GS:ffff9bfcdeac0000(0000) knlGS:0000000000000000
[1457468.880954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1457468.880955] CR2: ffffbcda805effe8 CR3: 00000003a65ee000 CR4: 0000000000340ee0
[1457468.880955] Call Trace:
[1457468.880955] ? mempool_kfree+0xe/0x10
[1457468.880956] ? kfree+0xb8/0x220
[1457468.880956] ? mempool_kfree+0xe/0x10
[1457468.880956] ? mempool_free+0x2f/0x80
[1457468.880956] ? md_end_io+0x4b/0x70
[1457468.880957] ? bio_endio+0xe6/0x150
[1457468.880957] ? bio_chain_endio+0x2d/0x40
[1457468.880957] ? md_end_io+0x5d/0x70
[1457468.880958] ? bio_endio+0xe6/0x150
[1457468.880958] ? bio_chain_endio+0x2d/0x40
[1457468.880958] ? md_end_io+0x5d/0x70
[1457468.880959] ? bio_endio+0xe6/0x150
[1457468.880959] ? bio_chain_endio+0x2d/0x40
[1457468.880959] ? md_end_io+0x5d/0x70
[1457468.880959] ? bio_endio+0xe6/0x150
[1457468.880960] ? bio_chain_endio+0x2d/0x40
[1457468.880960] ? md_end_io+0x5d/0x70
[1457468.880960] ? bio_endio+0xe6/0x150
[1457468.880960] ? bio_chain_endio+0x2d/0x40
[1457468.880961] ? md_end_io+0x5d/0x70
[1457468.880961] ? bio_endio+0xe6/0x150
[1457468.880961] ? bio_chain_endio+0x2d/0x40
[1457468.880962] ? md_end_io+0x5d/0x70
[1457468.880962] ? bio_endio+0xe6/0x150
[1457468.880962] ? bio_chain_endio+0x2d/0x40
[1457468.880962] ? md_end_io+0x5d/0x70
[1457468.880963] ? bio_endio+0xe6/0x150
[1457468.880963] ? bio_chain_endio+0x2d/0x40
[1457468.880963] ? md_end_io+0x5d/0x70
[1457468.880963] ? bio_endio+0xe6/0x150
[1457468.880964] ? bio_chain_endio+0x2d/0x40
[1457468.880964] ? md_end_io+0x5d/0x70
[1457468.880964] ? bio_endio+0xe6/0x150
[1457468.880965] ? bio_chain_endio+0x2d/0x40
[1457468.880965] ? md_end_io+0x5d/0x70
[1457468.880965] ? bio_endio+0xe6/0x150
[1457468.880965] ? bio_chain_endio+0x2d/0x40
[1457468.880966] ? md_end_io+0x5d/0x70
[1457468.880966] ? bio_endio+0xe6/0x150
[1457...

Revision history for this message
Jake Staehle (staehle) wrote :
Download full text (13.8 KiB)

Another one today on 5.8.0-55:

[ OK ] Started Hostname Service.
[ OK ] Started User Login Management.
[ OK ] Started Docker Application Container Engine.

Ubuntu 20.10 babylon ttyS0

babylon login: [ 43.284962] cloud-init[6278]: Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.10.1 running 'modules:config' at Thu, 17 Jun 2021 04:50:23 +0000. Up 43.23 seconds.
[ 43.555449] cloud-init[6294]: Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.10.1 running 'modules:final' at Thu, 17 Jun 2021 04:50:23 +0000. Up 43.48 seconds.
[ 43.555559] cloud-init[6294]: Cloud-init v. 21.2-3-g899bfaa9-0ubuntu2~20.10.1 finished at Thu, 17 Jun 2021 04:50:23 +0000. Datasource DataSourceNone. Up 43.55 seconds
[ 43.555598] cloud-init[6294]: 2021-06-17 04:50:23,906 - cc_final_message.py[WARNING]: Used fallback datasource
[470667.791418] BUG: stack guard page was hit at 000000006cd7c52c (stack is 00000000b38fb7cf..00000000d2b542d2)
[470667.791418] kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI
[470667.791418] CPU: 15 PID: 514 Comm: md0_raid6 Tainted: P OE 5.8.0-55-generic #62-Ubuntu
[470667.791419] Hardware name: Gigabyte Technology Co., Ltd. B550M AORUS PRO/B550M AORUS PRO, BIOS F13h 04/23/2021
[470667.791419] RIP: 0010:slab_free_freelist_hook+0x35/0x120
[470667.791419] Code: 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 48 8b 02 4c 8b 36 48 c7 06 00 00 00 00 48 c7 02 00 00 00 00 48 85 c0 49 0f 44 c6 <48> 89 45 d0 eb 06 4c 3b 7d d0 74 5d 8b 53 20 4d 89 f7 49 8d 34 16
[470667.791419] RSP: 0018:ffff9b13808b3ff8 EFLAGS: 00010246
[470667.791420] RAX: ffff8c43bd86d9c0 RBX: ffff8c459b407800 RCX: 0000000000000001
[470667.791420] RDX: ffff9b13808b4040 RSI: ffff9b13808b4038 RDI: ffff8c459b407800
[470667.791420] RBP: ffff9b13808b4028 R08: 0000000000000001 R09: ffffffffae641900
[470667.791421] R10: ffff8c43bd86d1e0 R11: 0000000000000001 R12: ffff9b13808b4038
[470667.791421] R13: ffff9b13808b4040 R14: ffff8c43bd86d9c0 R15: ffff8c4585ea1070
[470667.791421] FS: 0000000000000000(0000) GS:ffff8c459edc0000(0000) knlGS:0000000000000000
[470667.791421] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[470667.791422] CR2: ffff9b13808b3fe8 CR3: 00000007948a8000 CR4: 0000000000340ee0
[470667.791422] Call Trace:
[470667.791422] ? mempool_kfree+0xe/0x10
[470667.791422] ? kfree+0xb8/0x220
[470667.791422] ? mempool_kfree+0xe/0x10
[470667.791422] ? mempool_free+0x2f/0x80
[470667.791422] ? md_end_io+0x4b/0x70
[470667.791423] ? bio_endio+0xe6/0x150
[470667.791423] ? bio_chain_endio+0x2d/0x40
[470667.791423] ? md_end_io+0x5d/0x70
[470667.791423] ? bio_endio+0xe6/0x150
[470667.791423] ? bio_chain_endio+0x2d/0x40
[470667.791423] ? md_end_io+0x5d/0x70
[470667.791423] ? bio_endio+0xe6/0x150
[470667.791424] ? bio_chain_endio+0x2d/0x40
[470667.791424] ? md_end_io+0x5d/0x70
[470667.791424] ? bio_endio+0xe6/0x150
[470667.791424] ? bio_chain_endio+0x2d/0x40
[470667.791424] ? md_end_io+0x5d/0x70
[470667.791424] ? bio_endio+0xe6/0x150
[470667.791424] ? bio_chain_endio+0x2d/0x40
[470667.791424] ? md_end_io+0x5d/0x70
[470667.791425] ? bio_endio+0xe6/0x150
[470667.791425] ? bio_chain_endio+0x2d/0x40
[470667.791425] ? md_end_io+0x5d/0x70
[470667.791425] ? bio_e...

Revision history for this message
Matt Thompson (prevailion) wrote :

I am experiencing this crash on an AWS i3.metal instance using mdadm.

There appear to be upstream patches for this issue:

https://lore.kernel.org<email address hidden>/T/

http://lkml.iu.edu/hypermail/linux/kernel/2107.1/04478.html

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mdadm (Ubuntu):
status: New → Confirmed
To post a comment you must log in.