System hang with Linux kernel due to mainline commit 24247aeeabe
- Bionic (18.04)
- Bug #1733662
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Joseph Salisbury | ||
Artful |
Fix Released
|
High
|
Joseph Salisbury | ||
Bionic |
Fix Committed
|
High
|
Joseph Salisbury |
Bug Description
== SRU Justification ==
The following mainline commit introduced a regression in v4.14-rc1:
24247aeeabe9 ("x86/intel_
This commit made it's way into Artful via Launchpad bug 1591609 as Artful commit
ac2fc5adab0f4b.
This bug was causing regression tests to hang about one in four
times when running cpu_offlining tests.
This patch to fix this regression was just submitted to mainline, so it is also
needed in Bionic.
== Fix ==
commit d47924417319e3b
Author: Thomas Gleixner <email address hidden>
Date: Tue Jan 16 19:59:59 2018 +0100
x86/
== Regression Potential ==
Low. This patch fixes a current regression that is a use after free.
### Original Bug Description ###
In doing Ubuntu 17.10 regression testing, we've encountered one computer (boldore, a Cisco UCS C240 M4 [VIC]), that hangs about one in four times when running our cpu_offlining test. This test attempts to take all the CPU cores offline except one, then brings them back online again. This test ran successfully on boldore with previous releases, but with 17.10, the system sometimes (about one in four runs) hangs. Reverting to Ubuntu 16.04.3, I found no problems; but when I upgraded the 16.04.3 installation to linux-image-
I initiated this bug report from an Ubuntu 16.04.3 installation running a 4.10 kernel; but as I said, this applies to the 4.13 kernel.
ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-
ProcVersionSign
Uname: Linux 4.10.0-38-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
Date: Tue Nov 21 17:36:06 2017
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
XDG_RUNTIME_
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: linux-hwe
UpgradeStatus: No upgrade log present (probably fresh install)
Rod Smith (rodsmith) wrote : | #1 |
- dmesg output with a 4.10 kernel Edit (8.6 KiB, text/plain)
- Dependencies.txt Edit (2.3 KiB, text/plain; charset="utf-8")
- JournalErrors.txt Edit (3.1 KiB, text/plain; charset="utf-8")
- ProcCpuinfoMinimal.txt Edit (1.0 KiB, text/plain; charset="utf-8")
tags: | added: hwcert-server |
Rod Smith (rodsmith) wrote : | #2 |
Rod Smith (rodsmith) wrote : | #3 |
Rod Smith (rodsmith) wrote : | #4 |
- Another dmesg output from feebas Edit (7.2 KiB, text/plain)
Here's the dmesg output from another run on feebas. In this case, the system has become unresponsive via SSH, although the console remains active.
Rod Smith (rodsmith) wrote : | #5 |
- dmesg output from three runs with the 4.15.0-041500rc1 kernel Edit (6.5 KiB, application/x-tar)
I've tried upgrading to the latest development kernel, from http://
* run1.txt -- In this run, the cpu_offlining script successfully shut
down all CPU nodes (except node 0, of course), but when bringing
them up again, the system segfaulted after bringing up several
nodes. Thereafter, any remotely substantive command (top or
shutdown, for instance) hung, although bash remained responsive
and I could take file listings with ls.
* run2.txt -- In this run, the cpu_offlining script segfaulted
when taking CPU nodes offline. The system then became unreliable
in the same way as with run 1.
* run3.txt -- In this run, the script seemed to complete successfully,
but the dmesg output includes errors associated with bringing up
several nodes. The system SEEMED TO operate normally thereafter,
but my testing was limited.
Rod Smith (rodsmith) wrote : | #6 |
- dmesg outputs from several kernels Edit (100.4 KiB, application/x-tar)
Here are some more test runs on boldore, using different kernels, mostly from http://
* 4.10.0-38-generic: No hang or misbehavior; verbose dmesg output.
* 4.11.0-
* 4.12.0-
more verbose and includes multiple "error -22" messages.
* 4.13.0-
now "error -19".
* 4.13.16-
output has no errors and is much shorter.
* 4.14.0-
thereafter; dmesg has multiple "error -19" messages and multiple
general protection fault dumps.
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
Joseph Salisbury (jsalisbury) wrote : | #7 |
When you have a chance, could you also test the current mainline kernel:
http://
This will tell us if we should perform a regular bisect to find the offending commit, or if it's fixed in mainline, we would perform a "Reverse" bisect to find the commit that fixes things.
tags: | added: kernel-da-key performing-bisect |
Joseph Salisbury (jsalisbury) wrote : | #8 |
I see you already tested 4.15-rc1, but it's worth while to also test -rc4.
Changed in linux (Ubuntu Artful): | |
status: | New → Triaged |
Changed in linux (Ubuntu Bionic): | |
status: | New → Triaged |
Changed in linux (Ubuntu Artful): | |
importance: | Undecided → High |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Rod Smith (rodsmith) wrote : | #9 |
Joseph, I've just tested 4.15-rc4, and the script crashed and the system became responsive to only the simplest commands when bringing CPU 9 back up, accompanied by this out of dmesg:
[ 166.722460] Hardware name: Cisco Systems Inc UCSC-C240-
[ 166.722540] RIP: 0010:__
[ 166.722578] RSP: 0000:ffffb75e8c
[ 166.722615] RAX: 0000000000000000 RBX: 43ea0882f873c0e8 RCX: 00000000000001bf
[ 166.722663] RDX: 00000000000001be RSI: 0000000000000000 RDI: 0000000000021040
[ 166.722711] RBP: ffffb75e8c7cbb40 R08: ffff9cc35d341eaa R09: ffff9ca3ff807c00
[ 166.722757] R10: ffffb75e8c7cbd08 R11: bc159441a547de42 R12: ffff9cc35d341eaa
[ 166.722805] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff9ca3ff807c00
[ 166.722852] FS: 000000000000000
[ 166.722905] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 166.722945] CR2: 0000000000000000 CR3: 0000001be7e09001 CR4: 00000000001606e0
[ 166.722992] Call Trace:
[ 166.723020] ? idr_alloc_
[ 166.723051] ? kstrdup_
[ 166.723081] kstrdup+0x31/0x60
[ 166.723107] kstrdup_
[ 166.723137] __kernfs_
[ 166.723168] kernfs_
[ 166.723197] kernfs_
[ 166.723229] sysfs_create_
[ 166.723261] kobject_
[ 166.723294] kobject_
[ 166.723323] ? device_
[ 166.723356] device_
[ 166.723385] cpu_device_
[ 166.723418] ? __slab_
[ 166.723449] ? _cond_resched+
[ 166.723481] cacheinfo_
[ 166.723515] ? get_cpu_
[ 166.723549] cpuhp_invoke_
[ 166.723587] ? padata_
[ 166.725151] cpuhp_thread_
[ 166.726682] smpboot_
[ 166.728221] kthread+0x11e/0x140
[ 166.729701] ? sort_range+
[ 166.731145] ? kthread_
[ 166.732551] ret_from_
[ 166.733906] Code: 4d 01 e0 4d 8b 18 4d 33 99 40 01 00 00 4c 89 c3 4c 31 db 65 48 0f c7 0f 0f 94 c0 84 c0 74 ac 4d 39 d8 74 14 49 63 41 20 48 01 c3 <48> 33 1b 49 33 99 40 01 00 00 0f 18 0b 41 f7 c5 00 80 00 00 0f
[ 166.736776] RIP: __kmalloc_
[ 166.738188] ---[ end trace 39ce10746b0f4324 ]---
If you want direct access to the affected hardware, that can be arranged. (If you've already got access to the certification network in 1SS, the affected system on which I've been doing most of the testing is boldore.) I'm also happy to run tests using test kernels that you give me.
Joseph Salisbury (jsalisbury) wrote : | #10 |
Thanks for testing mainline. The stack trace looks the same as prior kernels. We should perform a regular kernel bisect to identify the commit that introduced this regression.
It sounds like none of the upstream kernels exhibit this bug per comment #6, is that correct?
If that is the case, it may be due to an Ubuntu SAUCE patch. Can you give an early 17.10 kernel a test:
https:/
Rod Smith (rodsmith) wrote : | #11 |
The upstream 4.14.0 kernel DOES segfault, but none of the 4.13-series kernels does. Some of the 4.13-series kernels do have "error -19" or "error -22" messages in their dmesg output, though.
I've tried the kernel at https:/
Joseph Salisbury (jsalisbury) wrote : | #12 |
Thanks for testing. So we now know that 4.13.0-16 has the bug but 4.13.0-10 does not.
Can you next try 4.13.0-14:
https:/
Rod Smith (rodsmith) wrote : | #13 |
4.13.0-14 failed when offlining CPU 9:
[ 104.500965] ------------[ cut here ]------------
[ 104.500968] kernel BUG at /build/
[ 104.501256] invalid opcode: 0000 [#1] SMP
[ 104.501422] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
[ 104.503659] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-14-generic #15-Ubuntu
[ 104.504019] Hardware name: Cisco Systems Inc UCSC-C240-
[ 104.504537] task: ffff9a9838b6ae80 task.stack: ffffb7e90c7b8000
[ 104.504827] RIP: 0010:kfree+
[ 104.505003] RSP: 0018:ffffb7e90c
[ 104.505311] RAX: ffffd9d77eff0020 RBX: ffff9a9800000000 RCX: 00000001802a001a
[ 104.505617] RDX: 0000000000000000 RSI: ffffd9d77fe02400 RDI: 000065a740000000
[ 104.505938] RBP: ffffb7e90c7bbd78 R08: ffff9a9838091ec0 R09: 00000001802a001a
[ 104.506255] R10: ffffd9d77f000000 R11: 0000000000000000 R12: ffffffff87798960
[ 104.506763] R13: ffffffff869dd4f0 R14: 0000000000000009 R15: 0000000000000001
[ 104.507216] FS: 000000000000000
[ 104.507638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.507884] CR2: 00007ffdd8f1bff8 CR3: 00000016ff209000 CR4: 00000000001406e0
[ 104.508188] Call Trace:
[ 104.508311] kfree_const+
[ 104.508468] kobject_
[ 104.508626] device_
[ 104.508796] cpu_cache_
[ 104.508971] ? free_cache_
[ 104.509201] cacheinfo_
[ 104.509401] cpuhp_invoke_
[ 104.509616] cpuhp_down_
[ 104.509812] cpuhp_thread_
[ 104.509997] smpboot_
[ 104.510182] kthread+0x125/0x140
[ 104.510322] ? sort_range+
[ 104.510491] ? kthread_
[ 104.510706] ret_from_
[ 104.510870] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c
[ 104.511761] RIP: kfree+0x11c/0x160 RSP: ffffb7e90c7bbd60
[ 104.512003] ---[ end trace 2290fcc444ad32ff ]---
Bash remained active, but I couldn't issue any significant commands.
Joseph Salisbury (jsalisbury) wrote : | #14 |
Can you next try 4.13.0-12:
https:/
Rod Smith (rodsmith) wrote : | #15 |
4.13.0-12 seems to be OK; I ran it seven or eight times without a failure.
Joseph Salisbury (jsalisbury) wrote : | #16 |
There was no version 4.13.0-13, so I'll start a bisect between 4.13.0-12 and 4.13.0-14. I'll build a test kernel and post it shortly.
Changed in linux (Ubuntu Artful): | |
status: | Triaged → In Progress |
Changed in linux (Ubuntu Bionic): | |
status: | Triaged → In Progress |
Joseph Salisbury (jsalisbury) wrote : | #17 |
Hmm, now that I looked at the commits between 4.13.0-12 and 4.13.0-14, bug 1734327 looks similar. I built a test kernel already for that bug, and was wondering if you could test it.
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not?
Rod Smith (rodsmith) wrote : | #18 |
That one failed (the script stopped running after taking CPU 9 offline) with the following dmesg output:
[ 119.360953] ------------[ cut here ]------------
[ 119.360955] kernel BUG at /home/jsalisbur
[ 119.361405] invalid opcode: 0000 [#1] SMP
[ 119.361586] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
[ 119.363826] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-19-generic #22~lp1731031Tw
[ 119.364209] Hardware name: Cisco Systems Inc UCSC-C240-
[ 119.364687] task: ffff98cff8b49740 task.stack: ffffb3274c7b8000
[ 119.364973] RIP: 0010:kfree+
[ 119.365133] RSP: 0018:ffffb3274c
[ 119.365356] RAX: fffff57a3bff0020 RBX: ffff98cf00000000 RCX: 0000000000000490
[ 119.365663] RDX: 0000000000000000 RSI: ffff98cfff25f4a0 RDI: 0000676f80000000
[ 119.365964] RBP: ffffb3274c7bbd78 R08: 000000000001f4a0 R09: ffffffffbb5dcf6a
[ 119.366262] R10: fffff57a3c000000 R11: 0000000000000000 R12: ffffffffbbf98e60
[ 119.366552] R13: ffffffffbb1dd820 R14: 0000000000000009 R15: 0000000000000001
[ 119.366844] FS: 000000000000000
[ 119.367176] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.367412] CR2: 000055cc84772018 CR3: 0000000e48e09000 CR4: 00000000001406e0
[ 119.367706] Call Trace:
[ 119.367824] kfree_const+
[ 119.367975] kobject_
[ 119.368134] device_
[ 119.368311] cpu_cache_
[ 119.368486] ? free_cache_
[ 119.368709] cacheinfo_
[ 119.368897] cpuhp_invoke_
[ 119.369082] cpuhp_down_
[ 119.369253] cpuhp_thread_
[ 119.369433] smpboot_
[ 119.369598] kthread+0x125/0x140
[ 119.369732] ? sort_range+
[ 119.369882] ? kthread_
[ 119.370075] ret_from_
[ 119.370233] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 1c
[ 119.371052] RIP: kfree+0x11c/0x160 RSP: ffffb3274c7bbd60
[ 119.371313] ---[ end trace edef5d0868ec0d2a ]---
The system continued to run, and I was able to issue other commands (ifconfig, efibootmgr), but I rebooted just to be safe.
Joseph Salisbury (jsalisbury) wrote : | #19 |
I started a kernel bisect between v4.13.0-12 and v4.13.0-14. The kernel bisect will require testing of about 7-10 test kernels.
I built the first test kernel, up to the following commit:
1c8d41925cff579
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Rod Smith (rodsmith) wrote : | #20 |
There's nothing at the URL you posted, Joseph. Do I just need to give it more time to build, or is something wrong?
Joseph Salisbury (jsalisbury) wrote : | #21 |
Sorry, the packages should be there now. You should only need the linux-image and linux-image-extra .deb files.
Rod Smith (rodsmith) wrote : | #22 |
OK, I've run tests now. The system did not crash or otherwise misbehave, but the dmesg output was quite verbose, and included "error -19" messages. Here's a sample (apparently for just one CPU core; this sequence was repeated quite a few times):
[ 439.341956] smpboot: Booting Node 1 Processor 31 APIC 0x1f
[ 439.354783] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354795] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354814] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354836] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[ 439.354849] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354853] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354859] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354866] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354870] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354876] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354882] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354886] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354892] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354898] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354902] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354909] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354915] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354919] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354925] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354931] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354936] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354942] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354948] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354953] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354960] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354965] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[ 439.354978] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[ 439.354991] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[ 439.355003] EDAC sbridge: Seeking for: PCI ID 8086:2f6c
[ 439.355016] EDAC sbridge: Seeking for: PCI ID 8086:2f6d
[ 439.355029] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355033] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355039] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355046] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355049] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355055] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355062] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355067] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355073] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355079] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355084] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355090] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355095] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355101] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355107] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355112] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355117] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355123] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355355] EDAC MC0: Giving out device to module sb_eda...
Joseph Salisbury (jsalisbury) wrote : | #23 |
I built the next test kernel, up to the following commit:
8d9d2235a82ea41
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Rod Smith (rodsmith) wrote : | #24 |
That one completed one run of the test OK, but then crashed on the second one, when bringing CPU 15 back online, with the following dmesg output:
[ 160.596312] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[ 160.596537] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[ 160.596679] EDAC sbridge: Some needed devices are missing
[ 160.627089] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[ 160.651100] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[ 160.651271] EDAC sbridge: Couldn't find mci handler
[ 160.651422] EDAC sbridge: Couldn't find mci handler
[ 160.651572] EDAC sbridge: Failed to register device with error -19.
[ 161.099074] BUG: unable to handle kernel paging request at 0000000180040100
[ 161.099512] IP: __kmalloc_
[ 161.099704] PGD 1ff1f01067
[ 161.099705] P4D 1ff1f01067
[ 161.099871] PUD 0
[ 161.100373] Oops: 0000 [#2] SMP
[ 161.100548] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_
[ 161.102507] pps_core drm enic scsi_transport_fc megaraid_sas wmi
[ 161.102856] CPU: 2 PID: 3686 Comm: python3 Tainted: G D 4.13.0-13-generic #14~lp1733662Co
[ 161.103230] Hardware name: Cisco Systems Inc UCSC-C240-
[ 161.103624] task: ffff8f3de5989740 task.stack: ffffa3a7ce288000
[ 161.104024] RIP: 0010:__
[ 161.104431] RSP: 0018:ffffa3a7ce
[ 161.104846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000f95
[ 161.105274] RDX: 0000000000000f94 RSI: 0000000000000000 RDI: 000000000001f3e0
[ 161.105705] RBP: ffffa3a7ce28bc70 R08: ffff8f3dffc9f3e0 R09: ffff8f3dff807c00
[ 161.106148] R10: ffffffffbb017760 R11: ffff8f5df8fa21f2 R12: 00000000014080c0
[ 161.106599] R13: 0000000000000008 R14: 0000000180040100 R15: ffff8f3dff807c00
[ 161.107057] FS: 00007f7849b9870
[ 161.107530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 161.108014] CR2: 0000000180040100 CR3: 0000001ff6e6e000 CR4: 00000000001406e0
[ 161.108509] Call Trace:
[ 161.109012] ? alloc_cpumask_
[ 161.109523] ? on_each_
[ 161.110036] alloc_cpumask_
...
Joseph Salisbury (jsalisbury) wrote : | #25 |
I built the next test kernel, up to the following commit:
83d4a97746e5fac
The test kernel can be downloaded from:
http://
Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.
Thanks in advance
Rod Smith (rodsmith) wrote : | #26 |
The build from http://
Joseph Salisbury (jsalisbury) wrote : | #27 |
Thanks for testing. I'll mark that kernel as good. I think it's safe to ignore the "error -19" messages during the bisect. We just need to tell the bisect whether the kernel exhibits the original bug or not.
I built the next test kernel, up to the following commit:
97327adfdaf5d72
The test kernel can be downloaded from:
http://
Rod Smith (rodsmith) wrote : | #28 |
That one hung much like the others, with the system responding only to very basic commands (mostly bash internals), although the dmesg output continued further after the kernel bug message. Here's the dmesg output:
[ 107.652875] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[ 107.652995] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[ 107.653010] EDAC sbridge: Some needed devices are missing
[ 107.675559] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[ 107.703606] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[ 107.703639] EDAC sbridge: Couldn't find mci handler
[ 107.704195] EDAC sbridge: Couldn't find mci handler
[ 107.704618] EDAC sbridge: Failed to register device with error -19.
[ 108.163612] smpboot: Booting Node 1 Processor 8 APIC 0x10
[ 108.189804] intel_rapl: Found RAPL domain package
[ 108.189810] intel_rapl: Found RAPL domain dram
[ 108.189812] intel_rapl: DRAM domain energy unit 15300pj
[ 108.190389] ------------[ cut here ]------------
[ 108.190390] kernel BUG at /home/jsalisbur
[ 108.191016] invalid opcode: 0000 [#1] SMP
[ 108.191511] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_
[ 108.195174] libahci pps_core enic scsi_transport_fc megaraid_sas wmi
[ 108.195756] CPU: 8 PID: 302 Comm: kworker/8:3 Not tainted 4.13.0-13-generic #14~lp1733662Co
[ 108.196353] Hardware name: Cisco Systems Inc UCSC-C240-
[ 108.196971] Workqueue: events cpuset_
[ 108.197583] task: ffff8e3432fcae80 task.stack: ffffb5fb4e104000
[ 108.198236] RIP: 0010:kfree+
[ 108.198861] RSP: 0000:ffffb5fb4e
[ 108.199485] RAX: fffffb0ffeff0020 RBX: ffff8e3400000000 RCX: 000000018020001d
[ 108.200121] RDX: 0000000000000000 RSI: fffffb0fffd33600 RDI: 0000720b40000000
[ 108.200764] RBP: ffffb5fb4e107ce0 R08: ffff8e3434cd8c00 R09: 000000018020001d
[ 108.201405] R10: fffffb0fff000000 R11: 0000000000000000 R12: ffff8e343254f058
[ 108.202053] R13: ffffffff876ce3d3 R14: ffff8e34382b6d10 R15: 0000000000000000
[ 108.202703] FS: 000000000000000
[ 108.203367] CS: 0010 DS: 0000 ES: 0000 CR0: 0000...
Joseph Salisbury (jsalisbury) wrote : | #29 |
I built the next test kernel, up to the following commit:
646779c79c8ab13
The test kernel can be downloaded from:
http://
Rod Smith (rodsmith) wrote : | #30 |
That one ran our test script half a dozen times without failure, albeit with the "Error -19" messages in the dmesg output.
Note that I'm about to EOD, so I probably won't get to the next one until next year. Have a good holiday, Joseph!
Joseph Salisbury (jsalisbury) wrote : | #31 |
I hope you had a good holiday, Rod. I started up the bisect again.
I built the next test kernel, up to the following commit:
9ebf47f152918cc
The test kernel can be downloaded from:
http://
Rod Smith (rodsmith) wrote : | #32 |
Thanks, Joseph. My break was good; I hope yours was, too!
That latest version you posted completed half a dozen runs of the test script without incident, aside from the "error -19" messages.
Joseph Salisbury (jsalisbury) wrote : | #33 |
The bisect should only require testing about 2 or 3 more kernels.
I built the next test kernel, up to the following commit:
aa0998e265482fd
The test kernel can be downloaded from:
http://
Rod Smith (rodsmith) wrote : | #34 |
Joseph, that one also completed six runs with no problems except the "error -19" messages.
Joseph Salisbury (jsalisbury) wrote : | #35 |
I built the next test kernel, up to the following commit:
e6108d5475696d0
The test kernel can be downloaded from:
http://
Rod Smith (rodsmith) wrote : | #36 |
That one completed its first run, but then crashed when bringing CPU 14 back online, with the following dmesg output:
[ 163.176945] ------------[ cut here ]------------
[ 163.176949] kernel BUG at /home/jsalisbur
[ 163.178043] invalid opcode: 0000 [#1] SMP
[ 163.178995] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_
[ 163.186785] drm pps_core enic scsi_transport_fc megaraid_sas wmi
[ 163.188025] CPU: 14 PID: 93 Comm: cpuhp/14 Not tainted 4.13.0-13-generic #14~lp1733662Co
[ 163.189294] Hardware name: Cisco Systems Inc UCSC-C240-
[ 163.190606] task: ffff8dbaf809c5c0 task.stack: ffffae2acc8a8000
[ 163.191926] RIP: 0010:kfree+
[ 163.193255] RSP: 0000:ffffae2acc
[ 163.194600] RAX: fffff9cb3bff0020 RBX: ffff8dba00000000 RCX: ffffae2acc8abb60
[ 163.195954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000728480000000
[ 163.197311] RBP: ffffae2acc8abb98 R08: ffffae2acc8abaec R09: 0000000000000002
[ 163.198703] R10: fffff9cb3c000000 R11: 0000000000000000 R12: ffff8d9aff94beb0
[ 163.200096] R13: ffffffffa6f2034b R14: ffff8dbaf27e4318 R15: ffff8dbaf27e4200
[ 163.201497] FS: 000000000000000
[ 163.202919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 163.204351] CR2: 0000000000000000 CR3: 000000101aa09000 CR4: 00000000001406e0
[ 163.205802] Call Trace:
[ 163.207253] acpi_ns_
[ 163.208704] ? kernfs_
[ 163.210183] ? down_timeout+
[ 163.211644] ? acpi_os_
[ 163.213098] acpi_ns_
[ 163.214550] ? acpi_ns_
[ 163.216016] acpi_get_
[ 163.217486] acpi_has_
[ 163.218932] acpi_processor_
[ 163.220391] ? wrmsrl_
[ 163.221870] acpi_processor_
[ 163.223354] __intel_
[ 163.224835] ? intel_pstate_
[ 163.226323] intel_pstate_
[ 163.227819] cpufreq_
[ 163.229301] ? cpufreq_
[ 163.230781] cpuhp_cpufreq_
[ 163.232262] cpuhp_invoke_
[ 163.233758] cpuhp_up_
[ 163.235254] cpuhp_thr...
Joseph Salisbury (jsalisbury) wrote : | #37 |
I built the next test kernel, up to the following commit:
ac2fc5adab0f4b8
The test kernel can be downloaded from:
http://
Rod Smith (rodsmith) wrote : | #38 |
That one completed two runs, but on the second run, dmesg included the following message at one point:
[ 240.841694] kernel BUG at /home/jsalisbur
[ 240.842765] invalid opcode: 0000 [#1] SMP
[ 240.843718] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_
[ 240.851457] drm pps_core megaraid_sas scsi_transport_fc enic wmi
[ 240.852693] CPU: 8 PID: 2724 Comm: irqbalance Not tainted 4.13.0-13-generic #14~lp1733662Co
[ 240.853965] Hardware name: Cisco Systems Inc UCSC-C240-
[ 240.855281] task: ffff9b62a76645c0 task.stack: ffffb973cf6fc000
[ 240.856603] RIP: 0010:kfree+
[ 240.857937] RSP: 0018:ffffb973cf
[ 240.859280] RAX: fffff8803cff0020 RBX: ffff9b6200000000 RCX: 0000000000000000
[ 240.860632] RDX: 0000000000000000 RSI: ffff9b62b0eb5348 RDI: 000064dcc0000000
[ 240.861995] RBP: ffffb973cf6ffa20 R08: ffff9b62b22f70f0 R09: 0000000180220021
[ 240.863367] R10: fffff8803d000000 R11: 0000000000000001 R12: ffff9b62b1648780
[ 240.864756] R13: ffffffffb65dd4e0 R14: ffff9b62a872f0d8 R15: ffff9b62a872fac0
[ 240.866145] FS: 00007ff8c4d0674
[ 240.867562] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 240.868986] CR2: 00007fff9ef860f8 CR3: 0000003fe7876000 CR4: 00000000001406e0
[ 240.870438] Call Trace:
[ 240.871882] kfree_const+
[ 240.873328] kernfs_
[ 240.874778] kernfs_
[ 240.876218] __dentry_
[ 240.877644] shrink_
[ 240.879078] d_invalidate+
[ 240.880526] lookup_
[ 240.881968] ? dput.part.
[ 240.883393] walk_component+
[ 240.884811] ? kernfs_
[ 240.886253] link_path_
[ 240.887690] ? path_init+
[ 240.889105] path_lookupat+
[ 240.890529] filename_
[ 240.891964] ? sprintf+0x51/0x70
[ 240.893387] ? __check_
[ 240.894822] ? strncpy_
[ 240.896240] user_path_
[ 240.897673] ? user_path_
[ 240.899101] vfs_statx+0x76/0xe0
[ 240.900517] SYSC_newstat+
[ 240.901934] ? ____fput+0xe/0x10
[ 240.903365] ? task_work_
[ 240.904783] ? exit_to_usermode...
Joseph Salisbury (jsalisbury) wrote : | #39 |
The bisect reported the following as the first bad commit:
commit ac2fc5adab0f4b8
Author: Vikas Shivappa <email address hidden>
Date: Tue Aug 15 18:00:43 2017 -0700
x86/
I built a test kernel with a revert of ac2fc5adab0.
The test kernel can be downloaded from:
http://
Rod Smith (rodsmith) wrote : | #40 |
I'm afraid that one fails, too, on the second run when bringing CPU10 back online. Here's the dmesg output:
[ 154.987312] smpboot: Booting Node 1 Processor 10 APIC 0x14
[ 154.992953] BUG: unable to handle kernel paging request at 0000317865646e69
[ 154.993932] IP: __kmalloc_
[ 154.994847] PGD 0
[ 154.994848] P4D 0
[ 154.997397] Oops: 0000 [#1] SMP
[ 154.998250] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_
[ 155.005714] libahci pps_core scsi_transport_fc enic megaraid_sas wmi
[ 155.006913] CPU: 10 PID: 69 Comm: cpuhp/10 Not tainted 4.13.0-13-generic #14~lp1733662Co
[ 155.008154] Hardware name: Cisco Systems Inc UCSC-C240-
[ 155.009427] task: ffff91c7b8785d00 task.stack: ffffa8760c7e8000
[ 155.010718] RIP: 0010:__
[ 155.012014] RSP: 0000:ffffa8760c
[ 155.013308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000014b9
[ 155.014618] RDX: 00000000000014b8 RSI: 0000000000000000 RDI: 000000000001f3e0
[ 155.015946] RBP: ffffa8760c7ebc80 R08: ffff91c7bf29f3e0 R09: ffff91a7bf807c00
[ 155.017284] R10: ffffa8760c7ebce0 R11: 0000000000000006 R12: 0000317865646e69
[ 155.018620] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff91a7bf807c00
[ 155.019965] FS: 000000000000000
[ 155.021329] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 155.022710] CR2: 0000317865646e69 CR3: 0000000ec6c09000 CR4: 00000000001406e0
[ 155.024101] Call Trace:
[ 155.025490] ? kvasprintf_
[ 155.026906] kvasprintf+
[ 155.028304] kvasprintf_
[ 155.029703] kobject_
[ 155.031101] cpu_device_
[ 155.032485] ? smp_call_
[ 155.033891] cacheinfo_
[ 155.035295] ? get_cpu_
[ 155.036709] cpuhp_invoke_
[ 155.038101] cpuhp_up_
[ 155.039513] cpuhp_thread_
[ 155.040923] smpboot_
[ 155.042319] kthread+0x125/0x140
[ 155.043706] ? sort_range+
[ 155.045107] ? kthread_
[ 155.046515] ret_from_
[ 155.047906] Code: 08 65 4c 03 05 ab e5 7d 5b 49 83 78 10 00 4d 8b 20 0f 84 ef 00 00 00 4d 85 e4 0f 84 e6 00 00 00 49 63 41 20 4...
Joseph Salisbury (jsalisbury) wrote : | #41 |
The uname looks like you may still be running the kernel from comment #37. The test kernel with the revert should have a name like:
linux-image-
The string "Revert" should be in the uname output.
Rod Smith (rodsmith) wrote : | #42 |
You're right. (I've got too many kernels installed on that system!) When I tested again, it got through eight runs without problems, beyond the "error -19" message. Here's the uname information, just to be sure:
$ uname -a
Linux oil-boldore 4.13.0-21-generic #24~lp1733662Revert SMP Mon Jan 8 15:35:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Joseph Salisbury (jsalisbury) wrote : | #43 |
Thanks for the update. I'll ping the author of mainline commit 24247aeeabe99eab to get some feedback.
Before I do that, can you confirm the bug still exists with the latest mainline kernel:
http://
Rod Smith (rodsmith) wrote : | #44 |
Yes, it still exists. To confirm the kernel version:
$ uname -a
Linux oil-boldore 4.15.0-
The system hung bringing CPU 11 back online, with the following dmesg output:
[ 101.179624] smpboot: Booting Node 1 Processor 11 APIC 0x16
[ 101.727507] general protection fault: 0000 [#1] SMP PTI
[ 101.727812] Modules linked in: nls_iso8859_1 intel_rapl sb_edac x86_pkg_
[ 101.730450] cryptd libfc libahci i2c_algo_bit drm scsi_transport_fc enic megaraid_sas wmi
[ 101.730883] CPU: 6 PID: 3205 Comm: python3 Not tainted 4.15.0-
[ 101.731319] Hardware name: Cisco Systems Inc UCSC-C240-
[ 101.731773] RIP: 0010:__
[ 101.732224] RSP: 0018:ffffa7d0cf
[ 101.732682] RAX: 0000000000000000 RBX: 3b37355eb8b32f18 RCX: 0000000000000349
[ 101.733146] RDX: 0000000000000348 RSI: 0000000000000000 RDI: 0000000000027040
[ 101.733609] RBP: ffffa7d0cf86bc20 R08: ffff94818ede9cdc R09: ffff9461bf807c00
[ 101.734075] R10: ffffffffaaa16cc0 R11: c4c8a1df366db3c4 R12: 00000000014080c0
[ 101.734547] R13: 0000000000000008 R14: ffff94818ede9cdc R15: ffff9461bf807c00
[ 101.735023] FS: 00007f8b0a2c270
[ 101.735510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.735997] CR2: 000056075d0c11a8 CR3: 0000001fe0b32003 CR4: 00000000001606e0
[ 101.736491] Call Trace:
[ 101.736988] ? alloc_cpumask_
[ 101.737488] ? on_each_
[ 101.737986] alloc_cpumask_
[ 101.738489] zalloc_
[ 101.738988] smpcfd_
[ 101.739493] cpuhp_invoke_
[ 101.740012] ? init_idle+
[ 101.740515] _cpu_up+0xb1/0x180
[ 101.741017] do_cpu_up+0x8b/0xb0
[ 101.741515] cpu_up+0x13/0x20
[ 101.742012] cpu_subsys_
[ 101.742510] device_
[ 101.743010] online_
[ 101.743506] dev_attr_
[ 101.744003] sysfs_kf_
[ 101.744501] kernfs_
[ 101.744998] __vfs_write+
[ 101.745494] ? common_
[ 101.745994] ? apparmor_
[ 101.746495] ? security_
[ 101.746993] ? _cond_resched...
Launchpad Janitor (janitor) wrote : | #45 |
Status changed to 'Confirmed' because the bug affects multiple users.
Changed in linux-hwe (Ubuntu Artful): | |
status: | New → Confirmed |
Changed in linux-hwe (Ubuntu): | |
status: | New → Confirmed |
Joseph Salisbury (jsalisbury) wrote : [REGRESSION][v4.14.y][v4.15] x86/intel_rdt/cqm: Improve limbo list processing | #47 |
Hi Vikas,
A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit resolved this bug:
commit 24247aeeabe99ea
Author: Vikas Shivappa <email address hidden>
Date: Tue Aug 15 18:00:43 2017 -0700
x86/
The regression was introduced as of v4.14-r1 and still exists with
current mainline. The trace with v4.15-rc7 is in comment #44[1].
I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?
Thanks,
Joe
[0] http://
[1]
https:/
summary: |
- System hang with Linux kernel 4.13, not with 4.10 + System hang with Linux kernel due to mainline commit 24247aeeabe |
tglx (tglx) wrote : | #48 |
On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> Hi Vikas,
>
> A kernel bug report was opened against Ubuntu [0]. After a kernel
> bisect, it was found that reverting the following commit resolved this bug:
>
> commit 24247aeeabe99ea
> Author: Vikas Shivappa <email address hidden>
> Date: Tue Aug 15 18:00:43 2017 -0700
>
> x86/intel_rdt/cqm: Improve limbo list processing
>
>
> The regression was introduced as of v4.14-r1 and still exists with
> current mainline. The trace with v4.15-rc7 is in comment #44[1].
>
> I was hoping to get your feedback, since you are the patch author. Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request?
That stinks like a use after free. Can you run with KASAN enabled?
Thanks,
tglx
Joseph Salisbury (jsalisbury) wrote : | #49 |
Hi Rod,
I built an Artful test kernel with KASAN enable.
The test kernel can be downloaded from:
http://
Can you test this kernel as requested by upstream?
tglx (tglx) wrote : | #50 |
Vikas, Fenghua can you please look at that ASAP?
On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
>
> > Hi Vikas,
> >
> > A kernel bug report was opened against Ubuntu [0]. After a kernel
> > bisect, it was found that reverting the following commit resolved this bug:
> >
> > commit 24247aeeabe99ea
> > Author: Vikas Shivappa <email address hidden>
> > Date: Tue Aug 15 18:00:43 2017 -0700
> >
> > x86/intel_rdt/cqm: Improve limbo list processing
> >
> >
> > The regression was introduced as of v4.14-r1 and still exists with
> > current mainline. The trace with v4.15-rc7 is in comment #44[1].
> >
> > I was hoping to get your feedback, since you are the patch author. Do
> > you think gathering any additional data will help diagnose this issue,
> > or would it be best to submit a revert request?
>
> That stinks like a use after free. Can you run with KASAN enabled?
>
> Thanks,
>
> tglx
Rod Smith (rodsmith) wrote : | #51 |
Joseph,
The first run of your latest kernel completed; however, I noticed the following in the dmesg output:
[ 426.281083] =======
[ 426.286615] BUG: KASAN: use-after-free in find_first_
[ 426.291841] Read of size 8 at addr ffff883ff7c1e780 by task cpuhp/31/195
[ 426.302209] CPU: 31 PID: 195 Comm: cpuhp/31 Not tainted 4.13.0-25-generic #29~lp1733662KA
[ 426.302213] Hardware name: Cisco Systems Inc UCSC-C240-
[ 426.302215] Call Trace:
[ 426.302233] dump_stack+
[ 426.302241] ? dma_virt_
[ 426.302252] ? show_regs_
[ 426.302263] print_address_
[ 426.302269] kasan_report+
[ 426.302276] ? find_first_
[ 426.302288] __asan_
[ 426.302295] find_first_
[ 426.302306] has_busy_
[ 426.302314] intel_rdt_
[ 426.302321] ? clear_closid_
[ 426.302333] ? sysfs_remove_
[ 426.302339] ? clear_closid_
[ 426.302351] cpuhp_invoke_
[ 426.302360] ? cpuhp_kick_
[ 426.302372] ? __schedule+
[ 426.302377] ? cpuhp_kick_
[ 426.302385] ? firmware_
[ 426.302395] ? migrate_
[ 426.302402] ? firmware_
[ 426.302407] ? migrate_
[ 426.302414] ? schedule+0xd8/0x2a0
[ 426.302421] ? __schedule+
[ 426.302427] ? default_
[ 426.302439] ? __wake_
[ 426.302446] cpuhp_down_
[ 426.302453] cpuhp_thread_
[ 426.302459] ? cpu_up+0x20/0x20
[ 426.302468] smpboot_
[ 426.302474] ? sort_range+
[ 426.302482] kthread+0x1b7/0x1e0
[ 426.302488] ? sort_range+
[ 426.302493] ? kthread_
[ 426.302500] ret_from_
[ 426.307683] Allocated by task 56:
[ 426.312817] save_stack_
[ 426.312824] save_stack+
[ 426.312829] kasan_kmalloc+
[ 426.312834] __kmalloc+
[ 426.312840] intel_rdt_
[ 426.312846] cpuhp_invoke_
[ 426.312850] cpuhp_thread_
[ 426.312856] smpboot_
[ 426.312861] kthread+0x1b7/0x1e0
[ 426.312866] ret_from_
[ 426.317887] Freed by task 195:
[ 426.322879] save_stack_
[ 426.322887] save_stack+
[ 426.322891] kasan_slab_
[ 426.322896] kfree+0x94/0x1a0
[ 426.322902] intel_rdt_
[ 426.322908] cpuhp_invoke_
[ 426.322912] cpuhp_down_
[ 426.322917] cpuhp_thread_
[ 426.322925] smpboot_
[ 426.322929] kthread+0x1b7/0x1e0
[ 426.322935] ret_from_
[ 426.327837] The buggy address belongs to the object at ffff883ff7c1e780
Joseph Salisbury (jsalisbury) wrote : | #52 |
On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> Vikas on vacation until end of the month. Fenghua will look into this
> issue.
>
> On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> <mailto:<email address hidden>>> wrote:
>
>>
>> Vikas, Fenghua can you please look at that ASAP?
>>
>> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
>>
>>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
>>>
>>>> Hi Vikas,
>>>>
>>>> A kernel bug report was opened against Ubuntu [0]. After a kernel
>>>> bisect, it was found that reverting the following commit resolved
>>>> this bug:
>>>>
>>>> commit 24247aeeabe99ea
>>>> Author: Vikas Shivappa <<email address hidden>
>>>> <mailto:<email address hidden>>>
>>>> Date: Tue Aug 15 18:00:43 2017 -0700
>>>>
>>>> x86/intel_rdt/cqm: Improve limbo list processing
>>>>
>>>>
>>>> The regression was introduced as of v4.14-r1 and still exists with
>>>> current mainline. The trace with v4.15-rc7 is in comment #44[1].
>>>>
>>>> I was hoping to get your feedback, since you are the patch author. Do
>>>> you think gathering any additional data will help diagnose this issue,
>>>> or would it be best to submit a revert request?
>>>
>>> That stinks like a use after free. Can you run with KASAN enabled?
>>>
>>> Thanks,
>>>
>>> tglx
Here is some data wiht KASAN enabled:
https:/
Are there any specific logs you would like to see, or specific actions
executed?
Thanks,
Joe
tglx (tglx) wrote : | #53 |
On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > Vikas on vacation until end of the month. Fenghua will look into this
> > issue.
> >
> > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> > <mailto:<email address hidden>>> wrote:
> >
> >>
> >> Vikas, Fenghua can you please look at that ASAP?
> >>
> >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> >>
> >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> >>>
> >>>> Hi Vikas,
> >>>>
> >>>> A kernel bug report was opened against Ubuntu [0]. After a kernel
> >>>> bisect, it was found that reverting the following commit resolved
> >>>> this bug:
> >>>>
> >>>> commit 24247aeeabe99ea
> >>>> Author: Vikas Shivappa <<email address hidden>
> >>>> <mailto:<email address hidden>>>
> >>>> Date: Tue Aug 15 18:00:43 2017 -0700
> >>>>
> >>>> x86/intel_rdt/cqm: Improve limbo list processing
> >>>>
> >>>>
> >>>> The regression was introduced as of v4.14-r1 and still exists with
> >>>> current mainline. The trace with v4.15-rc7 is in comment #44[1].
> >>>>
> >>>> I was hoping to get your feedback, since you are the patch author. Do
> >>>> you think gathering any additional data will help diagnose this issue,
> >>>> or would it be best to submit a revert request?
> >>>
> >>> That stinks like a use after free. Can you run with KASAN enabled?
> >>>
> >>> Thanks,
> >>>
> >>> tglx
>
>
> Here is some data wiht KASAN enabled:
> https:/
>
> Are there any specific logs you would like to see, or specific actions
> executed?
No, the KASAN output is pretty clear where the issue is.
Thanks,
tglx
Fenghua Yu (fyu) wrote : | #54 |
> From: Thomas Gleixner [mailto:<email address hidden>]
> On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> > On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > > Vikas on vacation until end of the month. Fenghua will look into
> > > this issue.
> > >
> > > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> > > <mailto:<email address hidden>>> wrote:
> > >
> > >>
> > >> Vikas, Fenghua can you please look at that ASAP?
> > >>
> > >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> > >>
> > >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> > >>>
> > >>>> Hi Vikas,
> > >>>>
> > >>>> A kernel bug report was opened against Ubuntu [0]. After a
> > >>>> kernel bisect, it was found that reverting the following commit
> > >>>> resolved this bug:
> > >>>>
> > >>>> commit 24247aeeabe99ea
> > >>>> Author: Vikas Shivappa <<email address hidden>
> > >>>> <mailto:<email address hidden>>>
> > >>>> Date: Tue Aug 15 18:00:43 2017 -0700
> > >>>>
> > >>>> x86/intel_rdt/cqm: Improve limbo list processing
> > >>>>
> > >>>>
> > >>>> The regression was introduced as of v4.14-r1 and still exists
> > >>>> with current mainline. The trace with v4.15-rc7 is in comment #44[1].
> > >>>>
> > >>>> I was hoping to get your feedback, since you are the patch
> > >>>> author. Do you think gathering any additional data will help
> > >>>> diagnose this issue, or would it be best to submit a revert request?
> > >>>
> > >>> That stinks like a use after free. Can you run with KASAN enabled?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> tglx
> >
> >
> > Here is some data wiht KASAN enabled:
> > https:/
> hwe/+bug/
> > nts/51
> >
> > Are there any specific logs you would like to see, or specific actions
> > executed?
>
> No, the KASAN output is pretty clear where the issue is.
>
> Thanks,
>
> tglx
Is this a Haswell specific issue?
I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
for ((;;)) do
for ((i=1;i<88;i++)) do
done
echo "online cpus:"
grep processor /proc/cpuinfo |wc
for ((i=1;i<88;i++)) do
done
echo "online cpus:"
grep processor /proc/cpuinfo|wc
done
I'm finding a Haswell to reproduce the issue.
Thanks.
-Fenghua
tglx (tglx) wrote : | #55 |
On Tue, 16 Jan 2018, Yu, Fenghua wrote:
> > From: Thomas Gleixner [mailto:<email address hidden>]
> Is this a Haswell specific issue?
>
> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
> for ((;;)) do
> for ((i=1;i<88;i++)) do
> echo 0 >/sys/devices/
> done
> echo "online cpus:"
> grep processor /proc/cpuinfo |wc
> for ((i=1;i<88;i++)) do
> echo 1 >/sys/devices/
> done
> echo "online cpus:"
> grep processor /proc/cpuinfo|wc
> done
>
> I'm finding a Haswell to reproduce the issue.
Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
You simply do not run into it because on your machine
is_
Thanks,
tglx
8<-----
diff --git a/arch/
index 88dcf8479013.
--- a/arch/
+++ b/arch/
@@ -525,10 +525,6 @@ static void domain_
*/
if (static_
rmdir_
- kfree(d->ctrl_val);
- kfree(d-
- kfree(d-
- kfree(d-
list_
if (is_mbm_enabled())
cancel_
@@ -545,6 +541,10 @@ static void domain_
cancel_
}
+ kfree(d->ctrl_val);
+ kfree(d-
+ kfree(d-
+ kfree(d-
kfree(d);
return;
}
Joseph Salisbury (jsalisbury) wrote : | #56 |
On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Yu, Fenghua wrote:
>>> From: Thomas Gleixner [mailto:<email address hidden>]
>> Is this a Haswell specific issue?
>>
>> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
>> for ((;;)) do
>> for ((i=1;i<88;i++)) do
>> echo 0 >/sys/devices/
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo |wc
>> for ((i=1;i<88;i++)) do
>> echo 1 >/sys/devices/
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo|wc
>> done
>>
>> I'm finding a Haswell to reproduce the issue.
> Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
>
> You simply do not run into it because on your machine
>
> is_llc_
>
> Thanks,
>
> tglx
>
> 8<-----
>
> diff --git a/arch/
> index 88dcf8479013.
> --- a/arch/
> +++ b/arch/
> @@ -525,10 +525,6 @@ static void domain_
> */
> if (static_
> rmdir_mondata_
> - kfree(d->ctrl_val);
> - kfree(d-
> - kfree(d-
> - kfree(d-
> list_del(&d->list);
> if (is_mbm_enabled())
> cancel_
> @@ -545,6 +541,10 @@ static void domain_
> cancel_
> }
>
> + kfree(d->ctrl_val);
> + kfree(d-
> + kfree(d-
> + kfree(d-
> kfree(d);
> return;
> }
Thanks, Thomas. I'll build some test kernels and have your patch tested
out.
Thanks,
Joe
Joseph Salisbury (jsalisbury) wrote : | #57 |
I built Artful and mainline test kernels with the patch from tglx. The test kernels can be downloaded from:
Artful: http://
mainline: http://
Can you test these kernels out and see if they resolve the bug?
Rod Smith (rodsmith) wrote : | #58 |
That seems to have fixed it! I've run the test script six or seven times on both kernels, with nary a hiccup (aside from the "error -19" messages with the 4.13 kernel). Below is the reported kernel information from both your builds, just to be sure I booted the correct kernels.
$ uname -a
Linux oil-boldore 4.13.0-25-generic #29~lp1733662Pa
$ uname -a
Linux oil-boldore 4.15.0-
Joseph Salisbury (jsalisbury) wrote : | #59 |
On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Yu, Fenghua wrote:
>>> From: Thomas Gleixner [mailto:<email address hidden>]
>> Is this a Haswell specific issue?
>>
>> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
>> for ((;;)) do
>> for ((i=1;i<88;i++)) do
>> echo 0 >/sys/devices/
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo |wc
>> for ((i=1;i<88;i++)) do
>> echo 1 >/sys/devices/
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo|wc
>> done
>>
>> I'm finding a Haswell to reproduce the issue.
> Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
>
> You simply do not run into it because on your machine
>
> is_llc_
>
> Thanks,
>
> tglx
>
> 8<-----
>
> diff --git a/arch/
> index 88dcf8479013.
> --- a/arch/
> +++ b/arch/
> @@ -525,10 +525,6 @@ static void domain_
> */
> if (static_
> rmdir_mondata_
> - kfree(d->ctrl_val);
> - kfree(d-
> - kfree(d-
> - kfree(d-
> list_del(&d->list);
> if (is_mbm_enabled())
> cancel_
> @@ -545,6 +541,10 @@ static void domain_
> cancel_
> }
>
> + kfree(d->ctrl_val);
> + kfree(d-
> + kfree(d-
> + kfree(d-
> kfree(d);
> return;
> }
Hi Thomas,
Testing of your patch shows that your patch resolves the bug. Thanks
for the assistance! Is this something you could submit to mainline?
Thanks,
Joe
tglx (tglx) wrote : | #60 |
On Wed, 17 Jan 2018, Joseph Salisbury wrote:
> On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
>
> Testing of your patch shows that your patch resolves the bug. Thanks
> for the assistance! Is this something you could submit to mainline?
Already there :)
Tagged for stable.
Thanks,
tglx
Joseph Salisbury (jsalisbury) wrote : | #61 |
On 01/17/2018 05:55 PM, Thomas Gleixner wrote:
> On Wed, 17 Jan 2018, Joseph Salisbury wrote:
>> On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
>>
>> Testing of your patch shows that your patch resolves the bug. Thanks
>> for the assistance! Is this something you could submit to mainline?
> Already there :)
>
> https:/
>
> Tagged for stable.
>
> Thanks,
>
> tglx
Thanks so much!
no longer affects: | linux-hwe (Ubuntu) |
no longer affects: | linux-hwe (Ubuntu Artful) |
no longer affects: | linux-hwe (Ubuntu Bionic) |
Joseph Salisbury (jsalisbury) wrote : | #62 |
I built one last Artful test kernel with the patch tglx submitted to mainline. The test kernel can be downloaded from:
http://
Can you test this kernel and confirm it resolves the bug?
Rod Smith (rodsmith) wrote : | #63 |
I ran it half a dozen times with your latest kernel and it seemed fine, aside from the usual "error -19" messages. To be sure it's the right one, here's the kernel version information:
ubuntu@
Linux oil-boldore 4.13.0-25-generic #29~lp1733662Pa
Joseph Salisbury (jsalisbury) wrote : | #64 |
SRU request submitted for Artful and Bionic.
https:/
description: | updated |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Artful): | |
status: | In Progress → Fix Committed |
Per Allansson (per-allansson) wrote : | #65 |
I have similar issues on 16.04.4 with latest HWE kernel - and when double-checking against the source code I can see that this fix is now AWOL from:
linux-image-
Changed in linux (Ubuntu Artful): | |
status: | Fix Committed → In Progress |
Changed in linux (Ubuntu Artful): | |
status: | In Progress → Fix Committed |
Stefan Bader (smb) wrote : | #66 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
tags: | added: verification-needed-artful |
Rod Smith (rodsmith) wrote : | #67 |
I've tested kernel 4.13.0-38-generic #43-Ubuntu from artful-proposed and the problem does not occur with that kernel.
tags: |
added: verification-done-artful removed: verification-needed-artful |
Launchpad Janitor (janitor) wrote : | #68 |
This bug was fixed in the package linux - 4.13.0-38.43
---------------
linux (4.13.0-38.43) artful; urgency=medium
* linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)
* Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
- i40e: Fix memory leak related filter programming status
- i40e: Add programming descriptors to cleaned_count
* [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
- platform/x86: ideapad-laptop: Increase timeout to wait for EC answer
* fails to dump with latest kpti fixes (LP: #1750021)
- kdump: write correct address of mem_section into vmcoreinfo
* headset mic can't be detected on two Dell machines (LP: #1748807)
- ALSA: hda/realtek - Support headset mode for ALC215/
- ALSA: hda - Fix headset mic detection problem for two Dell machines
- ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines
* CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
- CIFS: make IPC a regular tcon
- CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
- CIFS: dump IPC tcon in debug proc file
* i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
- i2c: octeon: Prevent error message on bus error
* hisi_sas: Add disk LED support (LP: #1752695)
- scsi: hisi_sas: directly attached disk LED feature for v2 hw
* EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
entries with KNL SNC2/SNC4 mode) (LP: #1743856)
- EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode
* [regression] Colour banding and artefacts appear system-wide on an Asus
Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
- drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA
* DVB Card with SAA7146 chipset not working (LP: #1742316)
- vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems
* [Asus UX360UA] battery status in unity-panel is not changing when battery is
being charged (LP: #1661876) // AC adapter status not detected on Asus
ZenBook UX410UAK (LP: #1745032)
- ACPI / battery: Add quirk for Asus UX360UA and UX410UAK
* ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
- ACPI / battery: Add quirk for Asus GL502VSK and UX305LA
* support thunderx2 vendor pmu events (LP: #1747523)
- perf pmu: Extract function to get JSON alias map
- perf pmu: Pass pmu as a parameter to get_cpuid_str()
- perf tools arm64: Add support for get_cpuid_str function.
- perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
- perf vendor events arm64: Add ThunderX2 implementation defined pmu core
events
- perf pmu: Add check for valid cpuid in perf_pmu_
* lpfc.ko module doesn't work (LP: #1746970)
- scsi: lpfc: Fix loop mode target discovery
* Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
- powerpc/
- powerpc/mm/slb: Move comment next to the code it's referring to
- powerpc/mm/hash64: Make vmalloc 56T on hash
* ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
- net...
Changed in linux (Ubuntu Artful): | |
status: | Fix Committed → Fix Released |
Changed in linux (Ubuntu): | |
status: | Fix Committed → Fix Released |
tags: | removed: hwcert-server |
I've discovered what may be the same bug on another system -- feebas, a Cisco UCS C220 M4 (Intel Series v3), with the same CPU type (Intel Xeon E5-2640 v3). I'm attaching dmesg output from it, but on this particular run, the computer did not hang indefinitely, although it did become unresponsive for a few seconds.