That one completed one run of the test OK, but then crashed on the second one, when bringing CPU 15 back online, with the following dmesg output:
[ 160.596312] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT) [ 160.596537] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT) [ 160.596679] EDAC sbridge: Some needed devices are missing [ 160.627089] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 [ 160.651100] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 [ 160.651271] EDAC sbridge: Couldn't find mci handler [ 160.651422] EDAC sbridge: Couldn't find mci handler [ 160.651572] EDAC sbridge: Failed to register device with error -19. [ 161.099074] BUG: unable to handle kernel paging request at 0000000180040100 [ 161.099512] IP: __kmalloc_node+0x135/0x2a0 [ 161.099704] PGD 1ff1f01067 [ 161.099705] P4D 1ff1f01067 [ 161.099871] PUD 0
[ 161.100373] Oops: 0000 [#2] SMP [ 161.100548] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp intel_cstate kvm_intel kvm irqbypass intel_rapl_perf joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler mei_me mei shpchp lpc_ich acpi_pad mac_hid acpi_power_meter ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas fnic crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc igb hid_generic drm_kms_helper aesni_intel dca syscopyarea i2c_algo_bit sysfillrect aes_x86_64 sysimgblt usbhid libfcoe crypto_simd fb_sys_fops ahci ptp glue_helper hid mxm_wmi libfc cryptd libahci [ 161.102507] pps_core drm enic scsi_transport_fc megaraid_sas wmi [ 161.102856] CPU: 2 PID: 3686 Comm: python3 Tainted: G D 4.13.0-13-generic #14~lp1733662Commit8d9d2235a82ea41 [ 161.103230] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016 [ 161.103624] task: ffff8f3de5989740 task.stack: ffffa3a7ce288000 [ 161.104024] RIP: 0010:__kmalloc_node+0x135/0x2a0 [ 161.104431] RSP: 0018:ffffa3a7ce28bc30 EFLAGS: 00010246 [ 161.104846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000f95 [ 161.105274] RDX: 0000000000000f94 RSI: 0000000000000000 RDI: 000000000001f3e0 [ 161.105705] RBP: ffffa3a7ce28bc70 R08: ffff8f3dffc9f3e0 R09: ffff8f3dff807c00 [ 161.106148] R10: ffffffffbb017760 R11: ffff8f5df8fa21f2 R12: 00000000014080c0 [ 161.106599] R13: 0000000000000008 R14: 0000000180040100 R15: ffff8f3dff807c00 [ 161.107057] FS: 00007f7849b98700(0000) GS:ffff8f3dffc80000(0000) knlGS:0000000000000000 [ 161.107530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 161.108014] CR2: 0000000180040100 CR3: 0000001ff6e6e000 CR4: 00000000001406e0 [ 161.108509] Call Trace: [ 161.109012] ? alloc_cpumask_var_node+0x1f/0x30 [ 161.109523] ? on_each_cpu_cond+0x160/0x160 [ 161.110036] alloc_cpumask_var_node+0x1f/0x30 [ 161.110558] zalloc_cpumask_var_node+0xf/0x20 [ 161.111084] smpcfd_prepare_cpu+0x64/0xc0 [ 161.111615] cpuhp_invoke_callback+0x84/0x3b0 [ 161.112151] cpuhp_up_callbacks+0x36/0xc0 [ 161.112690] _cpu_up+0x87/0xd0 [ 161.113235] do_cpu_up+0x8b/0xb0 [ 161.113785] cpu_up+0x13/0x20 [ 161.114342] cpu_subsys_online+0x3d/0x90 [ 161.114881] device_online+0x4a/0x90 [ 161.115422] online_store+0x89/0xa0 [ 161.115951] dev_attr_store+0x18/0x30 [ 161.116472] sysfs_kf_write+0x37/0x40 [ 161.116994] kernfs_fop_write+0x11c/0x1a0 [ 161.117510] __vfs_write+0x18/0x40 [ 161.118029] vfs_write+0xb1/0x1a0 [ 161.118544] SyS_write+0x55/0xc0 [ 161.119062] entry_SYSCALL_64_fastpath+0x1e/0xa9 [ 161.119581] RIP: 0033:0x7f78497784a0 [ 161.120081] RSP: 002b:00007fff6e69ed48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 161.120602] RAX: ffffffffffffffda RBX: 0000000001ea8410 RCX: 00007f78497784a0 [ 161.121129] RDX: 0000000000000002 RSI: 0000000001fbe400 RDI: 0000000000000003 [ 161.121666] RBP: 0000000000a3e020 R08: 0000000000000000 R09: 0000000000000001 [ 161.122202] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003 [ 161.122720] R13: 0000000000501520 R14: 00007fff6e69f1b0 R15: 00007f7848690240 [ 161.123226] Code: 89 cf 4c 89 4d c0 e8 0b 7f 01 00 49 89 c7 4c 8b 4d c0 4d 85 ff 0f 85 47 ff ff ff 45 31 f6 eb 3c 49 63 47 20 49 8b 3f 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 20 ff [ 161.124251] RIP: __kmalloc_node+0x135/0x2a0 RSP: ffffa3a7ce28bc30 [ 161.124738] CR2: 0000000180040100 [ 161.125220] ---[ end trace 1246d63efc5b2bf0 ]---
Rather than hang, as has happened before, the script crashed ("Killed" was displayed and I was dropped back to a bash prompt). The system behaved unreliably and I was forced to reboot it via its BMC.
That one completed one run of the test OK, but then crashed on the second one, when bringing CPU 15 back online, with the following dmesg output:
[ 160.596312] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT) node+0x135/ 0x2a0
[ 160.596537] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[ 160.596679] EDAC sbridge: Some needed devices are missing
[ 160.627089] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[ 160.651100] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[ 160.651271] EDAC sbridge: Couldn't find mci handler
[ 160.651422] EDAC sbridge: Couldn't find mci handler
[ 160.651572] EDAC sbridge: Failed to register device with error -19.
[ 161.099074] BUG: unable to handle kernel paging request at 0000000180040100
[ 161.099512] IP: __kmalloc_
[ 161.099704] PGD 1ff1f01067
[ 161.099705] P4D 1ff1f01067
[ 161.099871] PUD 0
[ 161.100373] Oops: 0000 [#2] SMP temp_thermal intel_powerclamp coretemp intel_cstate kvm_intel kvm irqbypass intel_rapl_perf joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler mei_me mei shpchp lpc_ich acpi_pad mac_hid acpi_power_meter ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_ iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas fnic crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc igb hid_generic drm_kms_helper aesni_intel dca syscopyarea i2c_algo_bit sysfillrect aes_x86_64 sysimgblt usbhid libfcoe crypto_simd fb_sys_fops ahci ptp glue_helper hid mxm_wmi libfc cryptd libahci mmit8d9d2235a82 ea41 M4L/UCSC- C240-M4L, BIOS C240M4. 2.0.10c. 0.032320160820 03/23/2016 kmalloc_ node+0x135/ 0x2a0 28bc30 EFLAGS: 00010246 0(0000) GS:ffff8f3dffc8 0000(0000) knlGS:000000000 0000000 var_node+ 0x1f/0x30 cpu_cond+ 0x160/0x160 var_node+ 0x1f/0x30 cpumask_ var_node+ 0xf/0x20 prepare_ cpu+0x64/ 0xc0 callback+ 0x84/0x3b0 callbacks+ 0x36/0xc0 online+ 0x3d/0x90 online+ 0x4a/0x90 store+0x89/ 0xa0 store+0x18/ 0x30 write+0x37/ 0x40 fop_write+ 0x11c/0x1a0 0x18/0x40 0xb1/0x1a0 64_fastpath+ 0x1e/0xa9 69ed48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 node+0x135/ 0x2a0 RSP: ffffa3a7ce28bc30
[ 161.100548] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_
[ 161.102507] pps_core drm enic scsi_transport_fc megaraid_sas wmi
[ 161.102856] CPU: 2 PID: 3686 Comm: python3 Tainted: G D 4.13.0-13-generic #14~lp1733662Co
[ 161.103230] Hardware name: Cisco Systems Inc UCSC-C240-
[ 161.103624] task: ffff8f3de5989740 task.stack: ffffa3a7ce288000
[ 161.104024] RIP: 0010:__
[ 161.104431] RSP: 0018:ffffa3a7ce
[ 161.104846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000f95
[ 161.105274] RDX: 0000000000000f94 RSI: 0000000000000000 RDI: 000000000001f3e0
[ 161.105705] RBP: ffffa3a7ce28bc70 R08: ffff8f3dffc9f3e0 R09: ffff8f3dff807c00
[ 161.106148] R10: ffffffffbb017760 R11: ffff8f5df8fa21f2 R12: 00000000014080c0
[ 161.106599] R13: 0000000000000008 R14: 0000000180040100 R15: ffff8f3dff807c00
[ 161.107057] FS: 00007f7849b9870
[ 161.107530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 161.108014] CR2: 0000000180040100 CR3: 0000001ff6e6e000 CR4: 00000000001406e0
[ 161.108509] Call Trace:
[ 161.109012] ? alloc_cpumask_
[ 161.109523] ? on_each_
[ 161.110036] alloc_cpumask_
[ 161.110558] zalloc_
[ 161.111084] smpcfd_
[ 161.111615] cpuhp_invoke_
[ 161.112151] cpuhp_up_
[ 161.112690] _cpu_up+0x87/0xd0
[ 161.113235] do_cpu_up+0x8b/0xb0
[ 161.113785] cpu_up+0x13/0x20
[ 161.114342] cpu_subsys_
[ 161.114881] device_
[ 161.115422] online_
[ 161.115951] dev_attr_
[ 161.116472] sysfs_kf_
[ 161.116994] kernfs_
[ 161.117510] __vfs_write+
[ 161.118029] vfs_write+
[ 161.118544] SyS_write+0x55/0xc0
[ 161.119062] entry_SYSCALL_
[ 161.119581] RIP: 0033:0x7f78497784a0
[ 161.120081] RSP: 002b:00007fff6e
[ 161.120602] RAX: ffffffffffffffda RBX: 0000000001ea8410 RCX: 00007f78497784a0
[ 161.121129] RDX: 0000000000000002 RSI: 0000000001fbe400 RDI: 0000000000000003
[ 161.121666] RBP: 0000000000a3e020 R08: 0000000000000000 R09: 0000000000000001
[ 161.122202] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[ 161.122720] R13: 0000000000501520 R14: 00007fff6e69f1b0 R15: 00007f7848690240
[ 161.123226] Code: 89 cf 4c 89 4d c0 e8 0b 7f 01 00 49 89 c7 4c 8b 4d c0 4d 85 ff 0f 85 47 ff ff ff 45 31 f6 eb 3c 49 63 47 20 49 8b 3f 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 20 ff
[ 161.124251] RIP: __kmalloc_
[ 161.124738] CR2: 0000000180040100
[ 161.125220] ---[ end trace 1246d63efc5b2bf0 ]---
Rather than hang, as has happened before, the script crashed ("Killed" was displayed and I was dropped back to a bash prompt). The system behaved unreliably and I was forced to reboot it via its BMC.