cpuset_hotplug in ubuntu_ltp_controllers hang with X-aws VM instances

Bug #2026722 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned

Bug Description

Issue found on X-aws 4.4.0-1158.173

The cpuset_hotplug test will timeout with the 45 minutes on specific VMs (passed on AWS bare-metals) because of task hang. Here is the list of failed instances:
* c4.large
  - xen / xen-hvm, 2 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single kaiser fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
* c5n.large
  - kvm, 2 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
* m5a.large
  - kvm, 2 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero arat npt nrip_save
* r5.large
  - kvm, 2 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
* t3.medium
  - kvm, 2 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku

Passed on these VMs:
* c3.xlarge
  - xen / xen-hvm, 4 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
* t3a.2xlarge
  - kvm, 8 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero arat npt nrip_save
* x1e.xlarge
  - xen / xen-hvm, 4 cpus
  - fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt

Test output:
<<<test_start>>>
tag=cpuset_hotplug stime=1688982088
cmdline="cpuset_hotplug_test.sh"
contacts=""
analysis=exit
<<<test_output>>>
cpuset_hotplug 1 TINFO: CPUs are numbered continuously starting at 0 (0-1)
cpuset_hotplug 1 TINFO: Nodes are numbered continuously starting at 0 (0)
cpuset_hotplug 1 TPASS: Cpuset vs CPU hotplug test succeeded.
cpuset_hotplug 3 TPASS: Cpuset vs CPU hotplug test succeeded.
cpuset_hotplug 5 TPASS: Cpuset vs CPU hotplug test succeeded.

This issue can be reproduced with X-aws-1128, dmesg output:
[ 1083.816153] LTP: starting cpuset_hotplug (cpuset_hotplug_test.sh)
[ 1083.948273] Broke affinity for irq 4
[ 1083.948396] Broke affinity for irq 61
[ 1083.948402] Broke affinity for irq 62
[ 1083.948408] Broke affinity for irq 63
[ 1083.948413] Broke affinity for irq 64
[ 1083.949824] smpboot: CPU 1 is now offline
[ 1085.000502] installing Xen timer for CPU 1
[ 1085.020261] x86: Booting SMP configuration:
[ 1085.020263] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 1085.021673] Skipped synchronization checks as TSC is reliable.
[ 1085.021693] cpu 1 spinlock event irq 59
[ 1085.069370] smpboot: CPU 1 is now offline
[ 1085.081932] installing Xen timer for CPU 1
[ 1085.100218] x86: Booting SMP configuration:
[ 1085.100220] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 1085.101588] Skipped synchronization checks as TSC is reliable.
[ 1085.101608] cpu 1 spinlock event irq 59
[ 1086.405418] smpboot: CPU 1 is now offline
[ 1087.467625] installing Xen timer for CPU 1
[ 1087.484259] x86: Booting SMP configuration:
[ 1087.484261] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 1087.485700] Skipped synchronization checks as TSC is reliable.
[ 1087.485721] cpu 1 spinlock event irq 59
[ 1320.308124] INFO: task systemd:1 blocked for more than 120 seconds.
[ 1320.311249] Not tainted 4.4.0-1128-aws #142-Ubuntu
[ 1320.313959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.318032] systemd D ffff8800ea4f7d28 0 1 0 0x00000000
[ 1320.318037] ffff8800ea4f7d28 ffff8800372adb40 ffffffff81e13500 ffff8800ea4f8000
[ 1320.318042] ffff8800ea4f8000 ffffffff81e605e4 ffff8800ea4f8000 00000000ffffffff
[ 1320.318045] ffffffff81e605e8 ffff8800ea4f7d40 ffffffff81844555 ffffffff81e605e0
[ 1320.318048] Call Trace:
[ 1320.318057] [<ffffffff81844555>] schedule+0x35/0x80
[ 1320.318061] [<ffffffff818448ae>] schedule_preempt_disabled+0xe/0x10
[ 1320.318065] [<ffffffff81846767>] __mutex_lock_slowpath+0xb7/0x130
[ 1320.318068] [<ffffffff818467ff>] mutex_lock+0x1f/0x30
[ 1320.318074] [<ffffffff8111f1bc>] proc_cgroup_show+0x4c/0x2e0
[ 1320.318078] [<ffffffff81289ac5>] proc_single_show+0x55/0x90
[ 1320.318083] [<ffffffff8123e662>] seq_read+0xf2/0x3d0
[ 1320.318088] [<ffffffff81354d86>] ? security_file_permission+0xa6/0xc0
[ 1320.318093] [<ffffffff8121820b>] __vfs_read+0x1b/0x40
[ 1320.318096] [<ffffffff81218986>] vfs_read+0x86/0x130
[ 1320.318100] [<ffffffff812196dc>] SyS_read+0x5c/0xe0
[ 1320.318104] [<ffffffff8184905b>] entry_SYSCALL_64_fastpath+0x22/0xd0
[ 1320.318118] INFO: task kworker/0:3:473 blocked for more than 120 seconds.
[ 1320.321669] Not tainted 4.4.0-1128-aws #142-Ubuntu
[ 1320.324364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.328026] kworker/0:3 D ffff8800364cfc68 0 473 2 0x00000000
[ 1320.328031] Workqueue: cgroup_destroy css_killed_work_fn
[ 1320.328033] ffff8800364cfc68 ffff8800d3ff4cc0 ffff880036d94600 ffff880036f9f000
[ 1320.328035] ffff8800364d0000 ffffffff81e42fa4 ffff880036f9f000 00000000ffffffff
[ 1320.328037] ffffffff81e42fa8 ffff8800364cfc80 ffffffff81844555 ffffffff81e42fa0
[ 1320.328039] Call Trace:
[ 1320.328043] [<ffffffff81844555>] schedule+0x35/0x80
[ 1320.328045] [<ffffffff818448ae>] schedule_preempt_disabled+0xe/0x10
[ 1320.328047] [<ffffffff81846767>] __mutex_lock_slowpath+0xb7/0x130
[ 1320.328052] [<ffffffff818467ff>] mutex_lock+0x1f/0x30
[ 1320.328072] [<ffffffff8108474c>] get_online_cpus+0x2c/0x50
[ 1320.328078] [<ffffffff81033ace>] arch_jump_label_transform+0x2e/0x120
[ 1320.328081] [<ffffffff81844011>] ? __schedule+0x301/0x810
[ 1320.328085] [<ffffffff81193a96>] __jump_label_update+0x76/0x90
[ 1320.328088] [<ffffffff81193b38>] jump_label_update+0x88/0x90
[ 1320.328092] [<ffffffff81193de1>] __static_key_slow_dec+0x41/0x90
[ 1320.328095] [<ffffffff81193e52>] static_key_slow_dec+0x22/0x60
[ 1320.328098] [<ffffffff8112263f>] cpuset_css_offline+0x3f/0x70
[ 1320.328102] [<ffffffff811195ac>] css_killed_work_fn+0x3c/0xd0
[ 1320.328106] [<ffffffff8109ec7b>] process_one_work+0x16b/0x4e0
[ 1320.328109] [<ffffffff8109f03e>] worker_thread+0x4e/0x580
[ 1320.328116] [<ffffffff8109eff0>] ? process_one_work+0x4e0/0x4e0
[ 1320.328119] [<ffffffff810a55e7>] kthread+0xe7/0x100
[ 1320.328122] [<ffffffff81844011>] ? __schedule+0x301/0x810
[ 1320.328128] [<ffffffff810a5500>] ? kthread_create_on_node+0x1e0/0x1e0
[ 1320.328131] [<ffffffff818494e5>] ret_from_fork+0x55/0x80
[ 1320.328134] [<ffffffff810a5500>] ? kthread_create_on_node+0x1e0/0x1e0
[ 1320.328142] INFO: task irqbalance:1189 blocked for more than 120 seconds.
[ 1320.331605] Not tainted 4.4.0-1128-aws #142-Ubuntu
[ 1320.334291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.338224] irqbalance D ffff8800e8b07ce8 0 1189 1 0x00000000
[ 1320.338228] ffff8800e8b07ce8 ffff8800e8b07dd0 ffffffff81e13500 ffff8800369dc600
[ 1320.338232] ffff8800e8b08000 ffff8800eb64f7cc ffff8800369dc600 00000000ffffffff
[ 1320.338235] ffff8800eb64f7d0 ffff8800e8b07d00 ffffffff81844555 ffff8800eb64f7c8
[ 1320.338238] Call Trace:
[ 1320.338242] [<ffffffff81844555>] schedule+0x35/0x80
[ 1320.338249] [<ffffffff818448ae>] schedule_preempt_disabled+0xe/0x10
[ 1320.338256] [<ffffffff81846767>] __mutex_lock_slowpath+0xb7/0x130
[ 1320.338259] [<ffffffff818467ff>] mutex_lock+0x1f/0x30
[ 1320.338264] [<ffffffff81554500>] online_show+0x20/0x60
[ 1320.338269] [<ffffffff81554a63>] dev_attr_show+0x23/0x60
[ 1320.338272] [<ffffffff818467f2>] ? mutex_lock+0x12/0x30
[ 1320.338279] [<ffffffff8129aeaf>] sysfs_kf_seq_show+0xbf/0x140
[ 1320.338283] [<ffffffff812993a9>] kernfs_seq_show+0x29/0x30
[ 1320.338285] [<ffffffff8123e662>] seq_read+0xf2/0x3d0
[ 1320.338289] [<ffffffff81299b5d>] kernfs_fop_read+0x10d/0x160
[ 1320.338292] [<ffffffff81354d86>] ? security_file_permission+0xa6/0xc0
[ 1320.338296] [<ffffffff8121820b>] __vfs_read+0x1b/0x40
[ 1320.338299] [<ffffffff81218986>] vfs_read+0x86/0x130
[ 1320.338303] [<ffffffff812196dc>] SyS_read+0x5c/0xe0
[ 1320.338306] [<ffffffff8184905b>] entry_SYSCALL_64_fastpath+0x22/0xd0
[ 1320.338317] INFO: task kworker/0:2:1448 blocked for more than 120 seconds.
[ 1320.341877] Not tainted 4.4.0-1128-aws #142-Ubuntu
[ 1320.344538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.348296] kworker/0:2 D ffff8800bf097d38 0 1448 2 0x00000000
[ 1320.348305] Workqueue: events vmstat_shepherd
[ 1320.348307] ffff8800bf097d38 ffffffff8141c925 ffffffff81e13500 ffff88008aa32a00
[ 1320.348310] ffff8800bf098000 ffffffff81e42fa4 ffff88008aa32a00 00000000ffffffff
[ 1320.348313] ffffffff81e42fa8 ffff8800bf097d50 ffffffff81844555 ffffffff81e42fa0
[ 1320.348316] Call Trace:
[ 1320.348321] [<ffffffff8141c925>] ? find_next_bit+0x15/0x20
[ 1320.348329] [<ffffffff81844555>] schedule+0x35/0x80
[ 1320.348332] [<ffffffff818448ae>] schedule_preempt_disabled+0xe/0x10
[ 1320.348336] [<ffffffff81846767>] __mutex_lock_slowpath+0xb7/0x130
[ 1320.348342] [<ffffffff818467ff>] mutex_lock+0x1f/0x30
[ 1320.348345] [<ffffffff8108474c>] get_online_cpus+0x2c/0x50
[ 1320.348350] [<ffffffff811b65c1>] vmstat_shepherd+0x21/0xd0
[ 1320.348354] [<ffffffff8109ec7b>] process_one_work+0x16b/0x4e0
[ 1320.348362] [<ffffffff8109f03e>] worker_thread+0x4e/0x580
[ 1320.348366] [<ffffffff8109eff0>] ? process_one_work+0x4e0/0x4e0
[ 1320.348369] [<ffffffff810a55e7>] kthread+0xe7/0x100
[ 1320.348372] [<ffffffff81844011>] ? __schedule+0x301/0x810
[ 1320.348375] [<ffffffff810a5500>] ? kthread_create_on_node+0x1e0/0x1e0
[ 1320.348378] [<ffffffff818494e5>] ret_from_fork+0x55/0x80
[ 1320.348381] [<ffffffff810a5500>] ? kthread_create_on_node+0x1e0/0x1e0
[ 1320.348386] INFO: task echo:1878 blocked for more than 120 seconds.
[ 1320.351463] Not tainted 4.4.0-1128-aws #142-Ubuntu
[ 1320.354182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.358193] echo D ffff88008aaf3c28 0 1878 1761 0x00000000
[ 1320.358197] ffff88008aaf3c28 000000000000001b ffffffff81e13500 ffff880036a83800
[ 1320.358201] ffff88008aaf4000 ffffffff81e6aec4 ffff880036a83800 00000000ffffffff
[ 1320.358204] ffffffff81e6aec8 ffff88008aaf3c40 ffffffff81844555 ffffffff81e6aec0
[ 1320.358207] Call Trace:
[ 1320.358212] [<ffffffff81844555>] schedule+0x35/0x80
[ 1320.358218] [<ffffffff818448ae>] schedule_preempt_disabled+0xe/0x10
[ 1320.358226] [<ffffffff81846767>] __mutex_lock_slowpath+0xb7/0x130
[ 1320.358230] [<ffffffff810ddf79>] ? vprintk_default+0x29/0x40
[ 1320.358233] [<ffffffff818467ff>] mutex_lock+0x1f/0x30
[ 1320.358239] [<ffffffff81193d70>] static_key_slow_inc+0x70/0xa0
[ 1320.358243] [<ffffffff810ab19d>] sched_cpu_active+0xdd/0xf0
[ 1320.358247] [<ffffffff810a659c>] notifier_call_chain+0x4c/0x70
[ 1320.358251] [<ffffffff810a66fe>] __raw_notifier_call_chain+0xe/0x10
[ 1320.358258] [<ffffffff81084913>] cpu_notify+0x23/0x40
[ 1320.358262] [<ffffffff810850c3>] _cpu_up+0x183/0x1b0
[ 1320.358265] [<ffffffff810851a2>] cpu_up+0xb2/0xd0
[ 1320.358268] [<ffffffff8155c5ed>] cpu_subsys_online+0x3d/0x90
[ 1320.358271] [<ffffffff81556f12>] device_online+0x62/0x90
[ 1320.358274] [<ffffffff81556fc9>] online_store+0x89/0xa0
[ 1320.358277] [<ffffffff81553e8b>] dev_attr_store+0x1b/0x30
[ 1320.358281] [<ffffffff8129a89c>] sysfs_kf_write+0x3c/0x50
[ 1320.358284] [<ffffffff8129a403>] kernfs_fop_write+0x143/0x180
[ 1320.358288] [<ffffffff8121833b>] __vfs_write+0x1b/0x40
[ 1320.358291] [<ffffffff81218ad9>] vfs_write+0xa9/0x1a0
[ 1320.358295] [<ffffffff812197bc>] SyS_write+0x5c/0xe0
[ 1320.358297] [<ffffffff8184905b>] entry_SYSCALL_64_fastpath+0x22/0xd0

This failure will cause the following cpuset_memory to fail. And leads to test interruption.

Po-Hsu Lin (cypressyew)
tags: added: ubuntu-ltp-controllers
Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I have cross checked -1128, -1158 with our LTP latest commit 3ebc2dfa85 and the older version ac1a3e40c5. This hang can be 100% reproduced.

And this is our first time running ubuntu_ltp_controllers test on X-aws.

description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Po-Hsu Lin (cypressyew)
description: updated
Po-Hsu Lin (cypressyew)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.