I stepped back from the bisect/debugging and looked at the higher level stats. The stress-ng test is started with one process for each core, and there are 96 of them. I looked at top[3] during a hang, and many of the stress-ng processes are running 'R'. However, a sysrq-q[2] also shows many stress-ng processes are 'D' in uninterruptible sleep. What also sticks out to me is all the stress-ng processes are running as root with a priority of 20. Looking back at one of the call traces[1], I see jbd2 stuck in an uninterruptible state: ... [ 4461.908213] task:journal-offline state:D stack: 0 pid:17541 ppid: 1 flags:0x00000226 ... The jdb2 kernel thread also running with a priority of 20[4]. When the hang happens, jbd2 is also stuck in an uninterruptible state(As well as systemd-journal): ... 1521 root 20 0 0 0 0 D 0.0 0.0 4:10.48 jbd2/sda2-8 1593 root 19 -1 64692 15832 14512 D 0.0 0.1 0:01.54 systemd-journal ... I am pinning all of the stress-ng threads to cores 1-95 and the kernel threads to a housekeeping cpu, 0. Output from cmdline: "BOOT_IMAGE=/boot/vmlinuz-5.15.0-1033-realtime root=UUID=3583d8c4-d539-439f-9d50-4341675268cc ro console=tty0 console=ttyS0,115200 skew_tick=1 isolcpus=managed_irq,domain,1-95 intel_pstate=disable nosoftlockup tsc=nowatchdog crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M" However, even with this pinning, stress-ng ends up running on cpu 0, per the ps output[4]. This appears to be causing a dead-lock between jdb2 and the stress-ng processes, since they share the same priority/niceness. To confirm this idea, I started test-storage / stress-ng so they had a lower priority than jbd2. I used the following: sudo nice -10 test-storage This causes jbd2 to continue to run with a priority of 20, but all the stress-ng threads are run with a priority of 30: PSR TID PID COMMAND %CPU PRI NI 0 1517 1517 jbd2/sda2-8 5.0 20 0 0 125875 125875 stress-ng 15.5 30 10 0 125882 125882 stress-ng 4.4 30 10 0 125925 125925 stress-ng 4.4 30 10 ... By adding 'nice -10' the test will complete without hanging. It appears the system hang was it waiting to complete I/O, which would never happen since the jdb2 threads cannot preempt stress-ng and causes a dead-lock. Michael, could you also try running with the following command to confirm the results: sudo nice -10 test-storage If this resolves the bug, there are several options: 1. Run the cert suite with a nice value for real-time tests. 2. Change the tests so they do not run as root. 3. Tune the real-time system so stress-ng threads are pinned to isolated cores and and kernel threads are on a housekeeping only core. I'm going to investigate option 3. I am assigning cores 1-95 as the isolated cores, so stress-ng should not run on core 0, but it is. I'm going to figure out why this is happening. [0] https://launchpadlibrarian.net/653810449/locking_issue.txt [1] https://launchpadlibrarian.net/653810490/call_trace.txt [2] https://launchpadlibrarian.net/655372944/sysrq-w.txt [3] https://launchpadlibrarian.net/655374168/top-during-hang.txt [4] https://launchpadlibrarian.net/655380123/ps-test-running.txt