mdcheck/checkarray locks up system
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
mdadm (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
We hit his bug in Ubuntu 22.04 with kernel 5.15.0-67-generic with a 3 drive RAID5.
We did not see this bug with 18.04.
The system becomes unresponsive and rebooting seems to be the only way to fix it.
There seems to be two bugs:
- a bug with pausing the raid check (check triggered by systemd)
- a bug with the raid check when it hasn't been paused (check triggered by systemd, or cron)
These are the logs on the first tine we say it, shortly after upgrading from 18.04 all the way to 22.04.
Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/
Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.573081] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+
Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+
Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+
Jun 4 12:48:11 server1 kernel: [1622699.573162] ? wait_woken+
Jun 4 12:48:11 server1 kernel: [1622699.573169] ? md_write_
Jun 4 12:48:11 server1 kernel: [1622699.573176] kthread+0x12a/0x150
Jun 4 12:48:11 server1 kernel: [1622699.573187] ? set_kthread_
Jun 4 12:48:11 server1 kernel: [1622699.573197] ret_from_
Jun 4 12:48:11 server1 kernel: [1622699.573212] </TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573231] INFO: task jbd2/dm-0-8:1375 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.581119] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.589004] "echo 0 > /proc/sys/
Jun 4 12:48:11 server1 kernel: [1622699.597959] task:jbd2/dm-0-8 state:D stack: 0 pid: 1375 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.597968] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.597970] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.597973] __schedule+
Jun 4 12:48:11 server1 kernel: [1622699.597984] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.597992] md_write_
Jun 4 12:48:11 server1 kernel: [1622699.598002] ? wait_woken+
Jun 4 12:48:11 server1 kernel: [1622699.598024] md_write_
Jun 4 12:48:11 server1 kernel: [1622699.598032] raid5_make_
Jun 4 12:48:11 server1 kernel: [1622699.598051] ? wait_woken+
Jun 4 12:48:11 server1 kernel: [1622699.598058] md_handle_
Jun 4 12:48:11 server1 kernel: [1622699.598065] ? __blk_queue_
Jun 4 12:48:11 server1 kernel: [1622699.598075] md_submit_
Jun 4 12:48:11 server1 kernel: [1622699.598082] __submit_
Jun 4 12:48:11 server1 kernel: [1622699.598091] ? mempool_
Jun 4 12:48:11 server1 kernel: [1622699.598102] __submit_
Jun 4 12:48:11 server1 kernel: [1622699.598110] ? kmem_cache_
Jun 4 12:48:11 server1 kernel: [1622699.598122] submit_
Jun 4 12:48:11 server1 kernel: [1622699.598131] submit_
Jun 4 12:48:11 server1 kernel: [1622699.598139] submit_
Jun 4 12:48:11 server1 kernel: [1622699.598151] submit_bh+0x13/0x20
Jun 4 12:48:11 server1 kernel: [1622699.598160] jbd2_journal_
Jun 4 12:48:11 server1 kernel: [1622699.598170] ? __update_
Jun 4 12:48:11 server1 kernel: [1622699.598184] kjournald2+
Jun 4 12:48:11 server1 kernel: [1622699.598190] ? wait_woken+
Jun 4 12:48:11 server1 kernel: [1622699.598197] ? load_superblock
Jun 4 12:48:11 server1 kernel: [1622699.598202] kthread+0x12a/0x150
Jun 4 12:48:11 server1 kernel: [1622699.598210] ? set_kthread_
Jun 4 12:48:11 server1 kernel: [1622699.598218] ret_from_
Jun 4 12:48:11 server1 kernel: [1622699.598229] </TASK>
Following some advice from
https:/
I changed over to cron kicking off the raid check, but we still hit this bug.
These are the logs for the problem happening with the raid check started by cron
Jul 31 21:00:01 server1 kernel: [168797.457900] md: data-check of RAID array md0
...
Jul 31 22:20:46 server1 kernel: [173642.734918] INFO: task kswapd0:190 blocked for more than 120 seconds.
Jul 31 22:20:46 server1 kernel: [173642.742246] Tainted: G O 5.15.0-75-generic #82-Ubuntu
Jul 31 22:20:46 server1 kernel: [173642.750039] "echo 0 > /proc/sys/
Jul 31 22:20:46 server1 kernel: [173642.758894] task:kswapd0 state:D stack: 0 pid: 190 ppid: 2 flags:0x00004000
Jul 31 22:20:46 server1 kernel: [173642.758906] Call Trace:
Jul 31 22:20:46 server1 kernel: [173642.758910] <TASK>
Jul 31 22:20:46 server1 kernel: [173642.758916] __schedule+
Jul 31 22:20:46 server1 kernel: [173642.758935] schedule+0x69/0x110
Jul 31 22:20:46 server1 kernel: [173642.758945] md_write_
Jul 31 22:20:46 server1 kernel: [173642.758957] ? wait_woken+
Jul 31 22:20:46 server1 kernel: [173642.758969] md_write_
Jul 31 22:20:46 server1 kernel: [173642.758980] raid5_make_
Jul 31 22:20:46 server1 kernel: [173642.759004] ? wait_woken+
Jul 31 22:20:46 server1 kernel: [173642.759012] md_handle_
Jul 31 22:20:46 server1 kernel: [173642.759020] ? page_index+
Jul 31 22:20:46 server1 kernel: [173642.759032] ? __blk_queue_
Jul 31 22:20:46 server1 kernel: [173642.759043] ? page_index+
Jul 31 22:20:46 server1 kernel: [173642.759052] md_submit_
Jul 31 22:20:46 server1 kernel: [173642.759060] __submit_
Jul 31 22:20:46 server1 kernel: [173642.759072] ? __mod_memcg_
Jul 31 22:20:46 server1 kernel: [173642.759082] __submit_
Jul 31 22:20:46 server1 kernel: [173642.759093] submit_
Jul 31 22:20:46 server1 kernel: [173642.759102] ? radix_tree_
Jul 31 22:20:46 server1 kernel: [173642.759116] ? bio_associate_
Jul 31 22:20:46 server1 kernel: [173642.759124] ? page_index+
Jul 31 22:20:46 server1 kernel: [173642.759133] ? bio_add_
Jul 31 22:20:46 server1 kernel: [173642.759139] submit_
Jul 31 22:20:46 server1 kernel: [173642.759149] __swap_
Jul 31 22:20:46 server1 kernel: [173642.759160] ? __frontswap_
Jul 31 22:20:46 server1 kernel: [173642.759167] swap_writepage+
Jul 31 22:20:46 server1 kernel: [173642.759177] pageout+0x100/0x330
Jul 31 22:20:46 server1 kernel: [173642.759189] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759202] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759211] ? super_cache_
Jul 31 22:20:46 server1 kernel: [173642.759222] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759233] ? shrink_
Jul 31 22:20:46 server1 kernel: [173642.759242] ? shrink_
Jul 31 22:20:46 server1 kernel: [173642.759251] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759262] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759272] balance_
Jul 31 22:20:46 server1 kernel: [173642.759281] ? finish_
Jul 31 22:20:46 server1 kernel: [173642.759294] kswapd+0x10c/0x1c0
Jul 31 22:20:46 server1 kernel: [173642.759303] ? balance_
Jul 31 22:20:46 server1 kernel: [173642.759313] kthread+0x12a/0x150
Jul 31 22:20:46 server1 kernel: [173642.759322] ? set_kthread_
Jul 31 22:20:46 server1 kernel: [173642.759332] ret_from_
Jul 31 22:20:46 server1 kernel: [173642.759344] </TASK>
Jul 31 22:20:46 server1 kernel: [173642.759356] INFO: task md0_raid5:409 blocked for more than 120 seconds.
...
In the same second nine more of these kinds of delays then silence.
I have seen the problem happen before the systemd mdcheck pause, and also shortly after the pause was supposed to have happened, as with below:
Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)
Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds.
Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu
Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/
Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000
Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug 7 00:23:03 server2 kernel: [2641193.450507] <TASK>
Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+
Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110
Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_
Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+
Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_
Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_
Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+
Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_
Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_
Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_
Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_
Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_
Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_
Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_
Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_
Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_
Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_
Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_
Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_
Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+
Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+
Aug 7 00:23:03 server2 kernel: [2641193.450757] ? load_superblock
Aug 7 00:23:03 server2 kernel: [2641193.450764] kthread+0x12a/0x150
Aug 7 00:23:03 server2 kernel: [2641193.450774] ? set_kthread_
Aug 7 00:23:03 server2 kernel: [2641193.450783] ret_from_
Aug 7 00:23:03 server2 kernel: [2641193.450796] </TASK>
summary: |
- mdcheck/checkarray lock up system + mdcheck/checkarray locks up system |
tags: | added: mdadm |
description: | updated |