Kernel 5.15.0-75 on Ubuntu 22.04, server1 and server2
Just hit this bug with a different server still running systemd to start the RAID check.
The logs indicate there may be more than one RAID check bug out there:
- a bug with pausing the raid check
- a bug with the raid check when it hasn't been paused
In the case with my server2 the raid check was started, then 6 hours later is was interrupted as expected with systemd, but about 20 minutes later the RAID locks up. Pause bug?
Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)
Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds.
Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu
Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000
Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug 7 00:23:03 server2 kernel: [2641193.450507] <TASK>
Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+0x24e/0x590
Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110
Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_start.part.0+0x174/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_start+0x14/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_request+0x77/0x540 [raid456]
Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_request+0x12d/0x1b0
Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_split+0xfe/0x210
Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_bio+0x76/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_bio+0x1a5/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_alloc_slab+0x17/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_bio_noacct+0x85/0x200
Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_alloc+0x1ab/0x2f0
Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_bio_noacct+0x4e/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_bio+0x4a/0x130
Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_bh_wbc+0x18d/0x1c0
Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_commit_transaction+0x861/0x17a0
Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_idle_core+0x93/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+0xa9/0x280
Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450757] ? load_superblock.part.0+0xc0/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450764] kthread+0x12a/0x150
Aug 7 00:23:03 server2 kernel: [2641193.450774] ? set_kthread_struct+0x50/0x50
Aug 7 00:23:03 server2 kernel: [2641193.450783] ret_from_fork+0x22/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450796] </TASK>
In my comment #41 the raid check started and rain into a problem BEFORE the pause:
(started by systemd)
Jun 4 09:04:25 server1 kernel: [1609273.391583] md: data-check of RAID array md0
and the problem started here, BEFORE the check ran for 6 hours, which is when it would have been paused
Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.573081] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+0xad/0x170
My comment #44 for server1 was for the raid check started by cron, which of course has no pause.
With server2 it looks like the pause left some lock on the RAID which caused the server to become unreachable.
With server1 the problem has happened twice now, once with systemd, and once with cron. Both times no pause was involved.
Both servers have RAID5 with about 30T, so relatively large.
Kernel 5.15.0-75 on Ubuntu 22.04, server1 and server2
Just hit this bug with a different server still running systemd to start the RAID check.
The logs indicate there may be more than one RAID check bug out there:
- a bug with pausing the raid check
- a bug with the raid check when it hasn't been paused
In the case with my server2 the raid check was started, then 6 hours later is was interrupted as expected with systemd, but about 20 minutes later the RAID locks up. Pause bug?
Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)
Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds. kernel/ hung_task_ timeout_ secs" disables this message. 0x24e/0x590 start.part. 0+0x174/ 0x220 0x70/0x70 start+0x14/ 0x30 request+ 0x77/0x540 [raid456] 0x70/0x70 request+ 0x12d/0x1b0 split+0xfe/ 0x210 bio+0x76/ 0xc0 bio+0x1a5/ 0x220 alloc_slab+ 0x17/0x20 bio_noacct+ 0x85/0x200 alloc+0x1ab/ 0x2f0 bio_noacct+ 0x4e/0x120 bio+0x4a/ 0x130 bh_wbc+ 0x18d/0x1c0 commit_ transaction+ 0x861/0x17a0 idle_core+ 0x93/0x120 0xa9/0x280 0x70/0x70 .part.0+ 0xc0/0xc0 struct+ 0x50/0x50 fork+0x22/ 0x30
Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu
Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/
Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000
Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug 7 00:23:03 server2 kernel: [2641193.450507] <TASK>
Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+
Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110
Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_
Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+
Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_
Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_
Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+
Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_
Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_
Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_
Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_
Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_
Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_
Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_
Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_
Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_
Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_
Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_
Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_
Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+
Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+
Aug 7 00:23:03 server2 kernel: [2641193.450757] ? load_superblock
Aug 7 00:23:03 server2 kernel: [2641193.450764] kthread+0x12a/0x150
Aug 7 00:23:03 server2 kernel: [2641193.450774] ? set_kthread_
Aug 7 00:23:03 server2 kernel: [2641193.450783] ret_from_
Aug 7 00:23:03 server2 kernel: [2641193.450796] </TASK>
In my comment #41 the raid check started and rain into a problem BEFORE the pause:
(started by systemd)
Jun 4 09:04:25 server1 kernel: [1609273.391583] md: data-check of RAID array md0
and the problem started here, BEFORE the check ran for 6 hours, which is when it would have been paused
Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds. kernel/ hung_task_ timeout_ secs" disables this message. 0x24e/0x590 0x70/0x70 0xad/0x170
Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/
Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.573081] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+
Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+
Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+
My comment #44 for server1 was for the raid check started by cron, which of course has no pause.
With server2 it looks like the pause left some lock on the RAID which caused the server to become unreachable.
With server1 the problem has happened twice now, once with systemd, and once with cron. Both times no pause was involved.
Both servers have RAID5 with about 30T, so relatively large.