[Ubuntu 1804][boston][ixgbe] EEH causes kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352 (i2S)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
High
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
High
|
Canonical Kernel Team | ||
Bionic |
Fix Released
|
High
|
Canonical Kernel Team |
Bug Description
== Comment: #0 - ABDUL HALEEM <> - 2018-02-16 08:26:15 ==
Problem:
------------
Injecting error multiple times causes kernel crash.
echo 0x0:1:4:
EEH: PHB#0 failure detected, location: N/A
EEH: PHB#0-PE#0 has failed 6 times in the
last hour and has been permanently disabled.
EEH: Unable to recover from failure from PHB#0-PE#0.
Please try reseating or replacing it
ixgbe 0000:01:00.1: Adapter removed
kernel BUG at /build/
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache joydev input_leds mac_hid idt_89hpesx ofpart ipmi_powernv cmdlinepart ipmi_devintf ipmi_msghandler at24 powernv_flash mtd opal_prd ibmpowernv uio_pdrv_genirq vmx_crypto uio sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
CPU: 28 PID: 972 Comm: eehd Not tainted 4.15.0-10-generic #11-Ubuntu
NIP: c00000000077f080 LR: c00000000077f070 CTR: c0000000000aac30
REGS: c000000ff1deb5a0 TRAP: 0700 Not tainted (4.15.0-10-generic)
MSR: 9000000000029033 <SF,HV,
CFAR: c00000000018bddc SOFTE: 1
GPR00: c00000000077f070 c000000ff1deb820 c0000000016ea600 c000000fbb5fac00
GPR04: 00000000000002c5 0000000000000000 0000000000000000 0000000000000000
GPR08: c000000fbb5fac00 0000000000000001 c000000fec617a00 c000000fdfd86488
GPR12: 0000000000000040 c000000007a33400 c000000000138be8 c000000ff90ec1c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000f48d10
GPR24: c000000000f48ce8 c000200e4fcf4000 c000000fc6900b18 c000200e4fcf4000
GPR28: c000200e4fcf4288 c008000010624480 0000000000000000 c000000fbb633ea0
NIP [c00000000077f080] free_msi_
LR [c00000000077f070] free_msi_
Call Trace:
[c000000ff1deb820] [c00000000077f070] free_msi_
[c000000ff1deb880] [c00000000077fa68] pci_disable_
[c000000ff1deb8c0] [c00800001060b5c8] ixgbe_reset_
[c000000ff1deb8f0] [c0080000105d52f4] ixgbe_remove+
[c000000ff1deb990] [c0000000007670ec] pci_device_
[c000000ff1deb9d0] [c00000000085d194] device_
[c000000ff1deba20] [c00000000075b398] pci_stop_
[c000000ff1deba60] [c00000000075b588] pci_stop_
[c000000ff1deba90] [c00000000005e1d0] pci_hp_
[c000000ff1debb20] [c00000000005e184] pci_hp_
[c000000ff1debbb0] [c00000000003ec04] eeh_handle_
[c000000ff1debc60] [c00000000003f160] eeh_handle_
[c000000ff1debd10] [c00000000003f830] eeh_event_
[c000000ff1debdc0] [c000000000138d88] kthread+0x1a8/0x1b0
[c000000ff1debe30] [c00000000000b528] ret_from_
Instruction dump:
419effe0 3bc00000 4800000c 60420000 807f0010 7c7e1a14 78630020 4ba0cd3d
60000000 e9430158 312affff 7d295110 <0b090000> 813f0014 395e0001 7d5e07b4
---[ end trace 23c446a470e60864 ]---
ixgbe 0000:01:00.0: Adapter removed
Sending IPI to other CPUs
OPAL: Switch to big-endian OS
OPAL: Switch to little-endian OS
PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
---uname output---
Linux ltciofvtr-bostonlc1 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:21:52 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
Machine Type = Boston-LC
0000:00:00.0 PCI bridge [0604]: IBM Device [1014:04c1]
0000:01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
0000:01:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
# ethtool -i enp1s0f0
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x800006da
expansion-
bus-info: 0000:01:00.0
supports-
supports-test: yes
supports-
supports-
supports-
Userspace tool common name: EEH
== Comment: #6 - Mauro Rodrigues <> - 2018-03-19 11:54:03 ==
Even though, probably it will not be accepted as is, I'll send a solution upstream.
The long story short: we add ixgbe_free_irq right before the ixgbe_clear_
That created a side effect, this is hotplug remove and with the patch applied, with the usual removal path (for instance from unbind in sysfs) that removes the interruption twice.
To avoid that I'll send a patch that integrates the free_irq in the clear interruption schema code path.
== Comment: #8 - Mauro Rodrigues <> - 2018-04-18 12:23:34 ==
waiting for upstream feedback at:
http://
which reads "ixgbe: Fix free irq process when removing device due to PCI Errors"
== Comment: #9 - Mauro Rodrigues <> - 2018-05-03 11:56:49 ==
The v3 of the patch is going through intel's queue for further testing
http://
which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device"
== Comment: #11 - Mauro Rodrigues <> - 2018-06-11 10:06:35 ==
this got merged to Torvald's tree last week and I didn't notice before.
https:/
which reads:
"ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device"
I'll submit to canonical ML today.
Changed in ubuntu-power-systems: | |
status: | New → Triaged |
importance: | Undecided → High |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
tags: | added: triage-g |
Changed in linux (Ubuntu): | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in linux (Ubuntu): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team) |
Changed in linux (Ubuntu): | |
status: | Triaged → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
status: | New → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | Triaged → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
tags: | added: cscc |
Default Comment by Bridge