cpuhotplug04 in cpuhotplug from ubuntu_ltp cause test hang on J-5.15 P9
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
In Progress
|
Medium
|
Unassigned | ||
ubuntu-kernel-tests |
In Progress
|
Undecided
|
Unassigned |
Bug Description
Issue found on Jammy 5.15.0-40.43 Power 9 node "baltar"
The cpuhotplug04 test will hang with:
Test output:
<<<test_start>>>
tag=cpuhotplug04 stime=1655706629
cmdline=
contacts=""
analysis=exit
<<<test_output>>>
incrementing stop
Name: cpuhotplug04
Date: Mon Jun 20 06:30:29 UTC 2022
Desc: Does it prevent us from offlining the last CPU?
dmesg output:
[ 342.099325] LTP: starting cpuhotplug04 (cpuhotplug04.sh -l 1)
[ 343.109144] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 343.336137] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 343.571276] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 343.787207] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.019402] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.255955] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.495940] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.659877] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.927993] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 345.124037] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 346.583794] EEH: Recovering PHB#3-PE#1fd
[ 346.583840] EEH: PE location: Onboard SAS, PHB location: N/A
[ 346.583885] EEH: Frozen PHB#3-PE#1fd detected
[ 346.583927] EEH: Call Trace:
[ 346.583950] EEH: [c0000000000554ec] __eeh_send_
[ 346.583989] EEH: [c00000000004e0b4] eeh_dev_
[ 346.584015] EEH: [c000000000aba988] __pci_write_
[ 346.584048] EEH: [c000000000245308] msi_domain_
[ 346.584081] EEH: [c0000000002354cc] irq_do_
[ 346.584104] EEH: [c0000000002446a4] irq_migrate_
[ 346.584138] EEH: [c0000000000d79a4] xive_smp_
[ 346.584160] EEH: [c0000000000eaaa8] pnv_smp_
[ 346.584192] EEH: [c0000000000644f4] __cpu_disable+
[ 346.584213] EEH: [c0000000001788cc] take_cpu_
[ 346.584236] EEH: [c0000000002db254] multi_cpu_
[ 346.584273] EEH: [c0000000002daa84] cpu_stopper_
[ 346.584298] EEH: [c0000000001bc2e0] smpboot_
[ 346.584340] EEH: [c0000000001b5774] kthread+0x184/0x190
[ 346.584378] EEH: [c00000000000cf64] ret_from_
[ 346.584417] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 346.584420] EEH: Notify device drivers to shutdown
[ 346.584423] EEH: Beginning: 'error_detected(IO frozen)'
[ 346.584427] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid-
[ 346.584434] aacraid 0003:01:00.0: aacraid: PCI error detected 2
[ 346.622048] sd 0:2:0:0: [sda] tag#764 FAILED Result: hostbyte=
[ 346.622067] sd 0:2:0:0: [sda] tag#764 CDB: Write(10) 2a 00 95 69 92 80 00 00 10 00
[ 346.622075] blk_update_request: I/O error, dev sda, sector 2506723968 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[ 346.622123] EXT4-fs warning (device sda2): ext4_end_bio:342: I/O error 10 writing to inode 38276474 starting block 313340498)
[ 346.622134] Buffer I/O error on device sda2, logical block 313338192
[ 346.622167] Buffer I/O error on device sda2, logical block 313338193
[ 346.622213] sd 0:2:0:0: [sda] tag#765 FAILED Result: hostbyte=
[ 346.622220] sd 0:2:0:0: [sda] tag#765 CDB: Write(10) 2a 00 95 69 cc e8 00 00 08 00
[ 346.622226] blk_update_request: I/O error, dev sda, sector 2506738920 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[ 346.622278] EXT4-fs warning (device sda2): ext4_end_bio:342: I/O error 10 writing to inode 38276475 starting block 313342366)
[ 346.622289] Buffer I/O error on device sda2, logical block 313340061
[ 346.622329] sd 0:2:0:0: [sda] tag#766 FAILED Result: hostbyte=
[ 346.622336] sd 0:2:0:0: [sda] tag#766 CDB: Write(10) 2a 00 6e 80 49 00 00 00 08 00
[ 346.622341] blk_update_request: I/O error, dev sda, sector 1853901056 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[ 346.622415] Buffer I/O error on dev sda2, logical block 231735328, lost async page write
[ 346.622461] EXT4-fs error (device sda2): ext4_check_
[ 346.622591] sd 0:2:0:0: [sda] tag#513 FAILED Result: hostbyte=
[ 346.622602] sd 0:2:0:0: [sda] tag#513 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[ 346.622613] blk_update_request: I/O error, dev sda, sector 1950670200 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 346.622694] Aborting journal on device sda2-8.
[ 346.623901] PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'need reset'
[ 346.623910] EEH: Finished:
[ 346.623919] EEH: Collect temporary log
[ 346.623947] EEH: of node=0003:01:00.0
[ 346.623954] EEH: PCI device/vendor: 028d9005
[ 346.623960] EEH: PCI cmd/status register: 00100146
[ 346.623964] EEH: PCI-E capabilities and status follow:
[ 346.623979] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083
[ 346.623991] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 346.623995] EEH: PCI-E 20: 00000000
[ 346.623998] EEH: PCI-E AER capability register set follows:
[ 346.624013] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030
[ 346.624025] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
[ 346.624037] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 346.624045] EEH: PCI-E AER 30: 00000000 00000000
[ 346.624048] PHB4 PHB#3 Diag-data (Version: 1)
[ 346.624052] brdgCtl: 00000002
[ 346.624055] RootSts: 00060040 00402000 a0830008 00100107 00000800
[ 346.624061] PhbSts: 0000001c00000000 0000001c00000000
[ 346.624065] Lem: 0000000100000080 0000000000000000 0000000000000080
[ 346.624069] PhbErr: 0000028000000000 0000020000000000 2148000098000240 a008400000000000
[ 346.624075] RxeTceErr: 2000000000000000 2000000000000000 40000000000001fd 0000000000000000
[ 346.624081] PblErr: 0000000000020000 0000000000020000 0000000000000000 0000000000000000
[ 346.624085] RegbErr: 0000004000000000 0000004000000000 8800003c00000000 0000000007011000
[ 346.624094] PE[1fd] A/B: 8000b03800000000 8000000000000000
[ 346.624098] EEH: Reset without hotplug activity
[ 346.625998] EXT4-fs error (device sda2): ext4_journal_
[ 346.626020] EXT4-fs error (device sda2): ext4_journal_
[ 349.134491] EXT4-fs error (device sda2): ext4_journal_
[ 359.548470] EEH: Beginning: 'slot_reset'
[ 359.548480] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid-
[ 359.548486] aacraid 0003:01:00.0: aacraid: PCI error - slot reset
[ 359.615158] PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'recovered'
[ 359.615162] EEH: Finished:
[ 359.615163] EEH: Notify device driver to resume
[ 359.615164] EEH: Beginning: 'resume'
[ 359.615172] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->resume()
[ 474.477185] EXT4-fs error (device sda2): ext4_journal_
[ 484.561619] INFO: task kworker/17:1:1020 blocked for more than 120 seconds.
[ 484.561643] Not tainted 5.15.0-40-generic #43-Ubuntu
[ 484.561664] "echo 0 > /proc/sys/
[ 484.561679] task:kworker/17:1 state:D stack: 0 pid: 1020 ppid: 2 flags:0x00000800
[ 484.561686] Workqueue: events cpuset_
[ 484.561694] Call Trace:
[ 484.561695] [c0000000198c3350] [c0000000198c3420] 0xc0000000198c3420 (unreliable)
[ 484.561703] [c0000000198c3540] [c00000000001ddf4] __switch_
[ 484.561711] [c0000000198c35a0] [c0000000012a8c38] __schedule+
[ 484.561718] [c0000000198c3660] [c0000000012a916c] schedule+0x6c/0x130
[ 484.561724] [c0000000198c3690] [c000000000220250] rwsem_down_
[ 484.561731] [c0000000198c3770] [c000000000535084] mpol_rebind_
[ 484.561737] [c0000000198c37b0] [c0000000002cfb7c] update_
[ 484.561743] [c0000000198c3870] [c0000000002d4df8] cpuset_
[ 484.561749] [c0000000198c3900] [c0000000002d5114] cpuset_
[ 484.561754] [c0000000198c3c70] [c0000000001a8634] process_
[ 484.561760] [c0000000198c3d10] [c0000000001a8f08] worker_
[ 484.561765] [c0000000198c3da0] [c0000000001b5774] kthread+0x184/0x190
[ 484.561772] [c0000000198c3e10] [c00000000000cf64] ret_from_
[ 484.561792] INFO: task jbd2/sda2-8:1767 blocked for more than 120 seconds.
[ 484.561816] Not tainted 5.15.0-40-generic #43-Ubuntu
[ 484.561837] "echo 0 > /proc/sys/
[ 484.561860] task:jbd2/sda2-8 state:D stack: 0 pid: 1767 ppid: 2 flags:0x00000800
[ 484.561866] Call Trace:
[ 484.561867] [c00000005640b540] [c00000005630ad00] 0xc00000005630ad00 (unreliable)
[ 484.561873] [c00000005640b730] [c00000000001ddf4] __switch_
[ 484.561879] [c00000005640b790] [c0000000012a8c38] __schedule+
[ 484.561886] [c00000005640b850] [c0000000012a916c] schedule+0x6c/0x130
[ 484.561891] [c00000005640b880] [c0000000012a92b4] io_schedule+
[ 484.561897] [c00000005640b8b0] [c0000000012a9ec8] bit_wait_
[ 484.561903] [c00000005640b8e0] [c0000000012a9898] __wait_
[ 484.561909] [c00000005640b930] [c0000000012a9a00] out_of_
[ 484.561916] [c00000005640b9a0] [c000000000619c90] __wait_
[ 484.561923] [c00000005640b9d0] [c0000000007a2ac8] jbd2_write_
[ 484.561930] [c00000005640ba80] [c0000000007a2894] jbd2_journal_
[ 484.561936] [c00000005640bb00] [c000000000795d14] jbd2_journal_
[ 484.561943] [c00000005640bd00] [c00000000079f36c] kjournald2+
[ 484.561948] [c00000005640bda0] [c0000000001b5774] kthread+0x184/0x190
[ 484.561955] [c00000005640be10] [c00000000000cf64] ret_from_
Past test result:
* This test was not executed (or not successfully executed on this node with 5.15.0-28.29 ~ 5.15.0-36.37)
I can reproduce it manually with 5.15.0-39
Steps to reproduce:
sudo apt install -y automake bison build-essential byacc flex git keyutils libacl1-dev libaio-dev libcap-dev libmm-dev libnuma-dev libsctp-dev libselinux1-dev libssl-dev libtirpc-dev pkg-config quota xfslibs-dev xfsprogs gcc
git clone --depth=1 https:/
cd ltp
make autotools
./configure
sudo make install
echo "cpuhotplug04 cpuhotplug04.sh -l 1" > /tmp/target
sudo /opt/ltp/runltp -f /tmp/target
Note that this issue does not exist on Impish 5.13.0 with this P9 node.
tags: | added: 5.15 jammy ppc64el sru-20220530 ubuntu-ltp |
description: | updated |
description: | updated |
tags: | added: sru-20220620 |
tags: | added: architecture-ppc64le bugnameltc-198904 severity-medium targetmilestone-inin--- |
tags: | added: sru-20230612 |
Changed in ubuntu-kernel-tests: | |
status: | New → In Progress |
Changed in ubuntu-power-systems: | |
status: | New → In Progress |
importance: | Undecided → Medium |
tags: | added: 6.2 lunar |
I can see this issue with 5.19-rc3 as well:
[ 199.281238] LTP: starting cpuhotplug04 (cpuhotplug04.sh -l 1) failure_ event+0x6c/ 0x120 check_failure. part.0+ 0x254/0x650 msi_msg+ 0x2a8/0x350 domain_ write_msg+ 0x38/0x50 set_affinity+ 0xe8/0x100 set_affinity+ 0x1e4/0x2a0 all_off_ this_cpu+ 0x324/0x390 disable_ cpu+0x34/ 0xd0 cpu_disable+ 0x88/0x120 0x54/0xb0 down+0x4c/ 0x110 stop+0xe4/ 0x230 thread+ 0x164/0x2c0 thread_ fn+0x1f0/ 0x260 kernel_ thread+ 0x5c/0x64 >error_ detected( IO frozen) DID_NO_ CONNECT driverbyte= DRIVER_ OK cmd_age=0s DID_NO_ CONNECT driverbyte= DRIVER_ OK cmd_age=0s DID_NO_ CONNECT driverbyte= DRIVER_ OK cmd_age=0s DID_NO_ CONNECT driverbyte= DRIVER_ OK cmd_age=0s DID_NO_ CONNECT driverbyte= DRIVER_ OK cmd_age=0s
[ 203.753796] EEH: Recovering PHB#3-PE#1fd
[ 203.753813] EEH: PE location: Onboard SAS, PHB location: N/A
[ 203.753817] EEH: Frozen PHB#3-PE#1fd detected
[ 203.753821] EEH: Call Trace:
[ 203.753822] EEH: [c00000000005596c] __eeh_send_
[ 203.753840] EEH: [c00000000004e5b4] eeh_dev_
[ 203.753851] EEH: [c000000000ad9d58] __pci_write_
[ 203.753858] EEH: [c000000000ada2e8] pci_msi_
[ 203.753863] EEH: [c000000000241398] msi_domain_
[ 203.753869] EEH: [c000000000230c04] irq_do_
[ 203.753874] EEH: [c000000000240414] irq_migrate_
[ 203.753879] EEH: [c0000000000d3f24] xive_smp_
[ 203.753883] EEH: [c0000000000e5eb8] pnv_smp_
[ 203.753888] EEH: [c000000000063c84] __cpu_disable+
[ 203.753892] EEH: [c000000000170c4c] take_cpu_
[ 203.753896] EEH: [c0000000002da8d4] multi_cpu_
[ 203.753902] EEH: [c0000000002d9ee4] cpu_stopper_
[ 203.753907] EEH: [c0000000001b4e90] smpboot_
[ 203.753911] EEH: [c0000000001adaa4] kthread+0x124/0x130
[ 203.753916] EEH: [c00000000000ce54] ret_from_
[ 203.753920] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 203.753922] EEH: Notify device drivers to shutdown
[ 203.753925] EEH: Beginning: 'error_detected(IO frozen)'
[ 203.753928] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid-
[ 203.753934] aacraid 0003:01:00.0: aacraid: PCI error detected 2
[ 203.774865] sd 0:2:0:0: [sda] tag#404 FAILED Result: hostbyte=
[ 203.774884] sd 0:2:0:0: [sda] tag#404 CDB: Write(10) 2a 00 23 00 ad 00 00 02 00 00
[ 203.774890] I/O error, dev sda, sector 587246848 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774903] sd 0:2:0:0: [sda] tag#405 FAILED Result: hostbyte=
[ 203.774909] sd 0:2:0:0: [sda] tag#405 CDB: Write(10) 2a 00 23 00 af 00 00 02 00 00
[ 203.774912] I/O error, dev sda, sector 587247360 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774933] sd 0:2:0:0: [sda] tag#406 FAILED Result: hostbyte=
[ 203.774938] sd 0:2:0:0: [sda] tag#406 CDB: Write(10) 2a 00 23 00 b1 00 00 02 00 00
[ 203.774940] I/O error, dev sda, sector 587247872 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774948] sd 0:2:0:0: [sda] tag#407 FAILED Result: hostbyte=
[ 203.774952] sd 0:2:0:0: [sda] tag#407 CDB: Write(10) 2a 00 23 00 b3 00 00 02 00 00
[ 203.774954] I/O error, dev sda, sector 587248384 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774962] sd 0:2:0:0: [sda] tag#408 FAILED Result: hostbyte=
[ 203.774965] sd 0:2:0:0: [sda] tag#408 CDB: Write(10) 2a 00 23...