Comment 46 for bug 1942935

Revision history for this message
Brian Corriveau (bcorrive) wrote :

Kernel 5.15.0-75 on Ubuntu 22.04, server1 and server2

Just hit this bug with a different server still running systemd to start the RAID check.

The logs indicate there may be more than one RAID check bug out there:
 - a bug with pausing the raid check
 - a bug with the raid check when it hasn't been paused

In the case with my server2 the raid check was started, then 6 hours later is was interrupted as expected with systemd, but about 20 minutes later the RAID locks up. Pause bug?

Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)

Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds.
Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu
Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000
Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug 7 00:23:03 server2 kernel: [2641193.450507] <TASK>
Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+0x24e/0x590
Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110
Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_start.part.0+0x174/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_start+0x14/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_request+0x77/0x540 [raid456]
Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_request+0x12d/0x1b0
Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_split+0xfe/0x210
Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_bio+0x76/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_bio+0x1a5/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_alloc_slab+0x17/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_bio_noacct+0x85/0x200
Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_alloc+0x1ab/0x2f0
Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_bio_noacct+0x4e/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_bio+0x4a/0x130
Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_bh_wbc+0x18d/0x1c0
Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_commit_transaction+0x861/0x17a0
Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_idle_core+0x93/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+0xa9/0x280
Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450757] ? load_superblock.part.0+0xc0/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450764] kthread+0x12a/0x150
Aug 7 00:23:03 server2 kernel: [2641193.450774] ? set_kthread_struct+0x50/0x50
Aug 7 00:23:03 server2 kernel: [2641193.450783] ret_from_fork+0x22/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450796] </TASK>

In my comment #41 the raid check started and rain into a problem BEFORE the pause:
(started by systemd)

Jun 4 09:04:25 server1 kernel: [1609273.391583] md: data-check of RAID array md0

and the problem started here, BEFORE the check ran for 6 hours, which is when it would have been paused

Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.573081] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+0xad/0x170

My comment #44 for server1 was for the raid check started by cron, which of course has no pause.

With server2 it looks like the pause left some lock on the RAID which caused the server to become unreachable.

With server1 the problem has happened twice now, once with systemd, and once with cron. Both times no pause was involved.

Both servers have RAID5 with about 30T, so relatively large.