mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Leann Ogasawara |
Bug Description
---Problem Description---
EEH is not working with mlx4 driver. When the driver recovered it hits another EEH.
---uname output---
Linux ubuntu 3.18.0-12-generic #13 SMP Mon Feb 9 16:31:42 CST 2015 ppc64le ppc64le ppc64le GNU/Linux
---Additional Hardware Info---
Need Mellanox adapter like Connect 3 adapter.
Machine Type = P8
---Steps to Reproduce---
Just inject EEH to mlx4 device.
Stack trace output:
from EEH recovery then it hits this:
[ 188.747571] EEH: Collect temporary log
[ 188.748330] EEH: of node=/pci@
[ 188.748339] EEH: PCI device/vendor: 100715b3
[ 188.748361] EEH: PCI cmd/status register: 00100146
[ 188.748362] EEH: PCI-E capabilities and status follow:
[ 188.748459] EEH: PCI-E 00: 00020010 10008e02 0001200e 0843f483
[ 188.748537] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 188.748539] EEH: PCI-E 20: 00000000
[ 188.748540] EEH: PCI-E AER capability register set follows:
[ 188.748625] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
[ 188.748704] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
[ 188.748783] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 188.748805] EEH: PCI-E AER 30: 00000000 00000000
[ 188.748813] EEH: Reset without hotplug activity
[ 193.833245] EEH: Notify device drivers the completion of reset
[ 193.833257] mlx4_core: Initializing 0001:00:03.0
[ 193.833317] mlx4_core 0001:00:03.0: BAR 0: can't reserve [mem 0x170b0000000-
[ 193.833321] mlx4_core 0001:00:03.0: Couldn't get PCI resources, aborting
[ 193.833395] EEH: Not recovered
[ 193.833397] EEH: Unable to recover from failure from PHB#1-PE#1.
Please try reseating or replacing it
[ 193.834531] EEH: of node=/pci@
[ 193.834547] EEH: PCI device/vendor: 100715b3
[ 193.834580] EEH: PCI cmd/status register: 00100142
[ 193.834582] EEH: PCI-E capabilities and status follow:
[ 193.834728] EEH: PCI-E 00: 00020010 10008e02 0000200e 0843f483
[ 193.834846] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 193.834849] EEH: PCI-E 20: 00000000
[ 193.834850] EEH: PCI-E AER capability register set follows:
[ 193.834981] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
[ 193.835101] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
[ 193.835219] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 193.835252] EEH: PCI-E AER 30: 00000000 00000000
[ 193.835289] Unable to handle kernel paging request for data at address 0x00000388
[ 193.835356] Faulting instruction address: 0xd000000001f3231c
[ 193.835415] Oops: Kernel access of bad area, sig: 11 [#1]
[ 193.835460] SMP NR_CPUS=2048 NUMA pSeries
[ 193.835509] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_
[ 193.835886] CPU: 6 PID: 50 Comm: eehd Not tainted 3.18.0-12-generic #13
[ 193.835942] task: c0000003f72ca880 ti: c0000003f707c000 task.ti: c0000003f707c000
[ 193.836009] NIP: d000000001f3231c LR: d000000001f32790 CTR: d000000001f32760
[ 193.836076] REGS: c0000003f707f790 TRAP: 0300 Not tainted (3.18.0-12-generic)
[ 193.836141] MSR: 8000000100009033 <SF,EE,
[ 193.836302] CFAR: c0000000000a7be0 DAR: 0000000000000388 DSISR: 40000000 SOFTE: 1
GPR00: d000000001f32790 c0000003f707fa10 d000000001f66310 c0000003fe0ad000
GPR04: 0000000000000003 0000000000000000 0000000000000000 c0000003fd000000
GPR08: 0000000000000001 d000000001f32760 00000000fffffffa 0000000100001001
GPR12: d000000001f32760 c00000000fb83600 c0000000000d9118 c0000003f90e56c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000c4ab90
GPR24: c000000000c4ab68 0000000000100100 c0000003fe068580 c0000003fe068580
GPR28: c0000003fe0ad000 c0000003fe0685e0 d000000001f5da50 0000000000000000
[ 193.837205] NIP [d000000001f3231c] mlx4_unload_
[ 193.837269] LR [d000000001f32790] mlx4_pci_
[ 193.837336] Call Trace:
[ 193.837361] [c0000003f707fa10] [c0000003fe068580] 0xc0000003fe068580 (unreliable)
[ 193.837447] [c0000003f707faa0] [d000000001f32790] mlx4_pci_
[ 193.837528] [c0000003f707fae0] [c00000000003ac64] eeh_report_
[ 193.837606] [c0000003f707fb10] [c0000000000393b4] eeh_pe_
[ 193.837685] [c0000003f707fba0] [c00000000003b148] eeh_handle_
[ 193.837764] [c0000003f707fc20] [c00000000003b6b4] eeh_handle_
[ 193.837832] [c0000003f707fcd0] [c00000000003bae4] eeh_event_
[ 193.837911] [c0000003f707fd80] [c0000000000d9220] kthread+0x110/0x130
[ 193.837980] [c0000003f707fe30] [c000000000009568] ret_from_
[ 193.838057] Instruction dump:
[ 193.838094] fb41ffd0 fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71
[ 193.838217] 7c7c1b78 48000008 e8410018 ebfc0138 <813f0388> 2f890000 409e020c e93f0008
[ 193.838341] ---[ end trace 7cd21329722bcbd1 ]---
There is a series of patches in this link that should resolve this issue.
http://
I had applied these in upstream kernel and it is ok but let me double check with Ubuntu 15.04 kernel if these are the patches we need to solve this bugzilla.
I used this kernel from Ubuntu 15.04 3.18.0-13.14
To make EEH work, to try to reach the first 2 patches of that series I have to use all this patches:
From ca9f9f703950e5c
From: Amir Vadai <email address hidden>
Date: Tue, 25 Feb 2014 18:17:52 +0200
Subject: net/mlx4_en: Fix bad use of dev_id
From adbc7ac5c15eb5e
From: Saeed Mahameed <email address hidden>
Date: Mon, 27 Oct 2014 11:37:37 +0200
Subject: net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap
From a53e3e8c1db5479
From: Saeed Mahameed <email address hidden>
Date: Mon, 27 Oct 2014 11:37:38 +0200
Subject: net/mlx4_core: Add ethernet backplane autoneg device capability
From d475c95b4bcff98
From: Matan Barak <email address hidden>
Date: Sun, 2 Nov 2014 16:26:17 +0200
Subject: net/mlx4_core: Add retrieval of CONFIG_DEV parameters
From dd65beac48a5259
From: Shani Michaeli <email address hidden>
Date: Sun, 9 Nov 2014 13:51:52 +0200
Subject: net/mlx4_en: Extend usage of napi_gro_frags
From f8c6455bb04b944
From: Shani Michaeli <email address hidden>
Date: Sun, 9 Nov 2014 13:51:53 +0200
Subject: net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
From ffc39f6d6fff287
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:29 +0200
Subject: net/mlx4_core: Refactor mlx4_cmd_init and mlx4_cmd_cleanup
From a0eacca948d2d45
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:30 +0200
Subject: net/mlx4_core: Refactor mlx4_load_one
From e8c4265bea8437f
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:31 +0200
Subject: net/mlx4_core: Add QUERY_FUNC firmware command
From 7ae0e400cd9396c
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:32 +0200
Subject: net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X
vectors for PF/VFs
From da315679e806350
From: Matan Barak <email address hidden>
Date: Sun, 14 Dec 2014 16:18:04 +0200
Subject: net/mlx4_core: Fixed memory leak and incorrect refcount in
with those patches I can apply from the series that I pointed:
==> 0001-net-
From 872bf2fb69d90e3
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:35 +0200
Subject: net/mlx4_core: Maintain a persistent memory for mlx4 device
==> 0002-net-
From dd0eefe3abbf474
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:36 +0200
Subject: net/mlx4_core: Set device configuration data to be persistent across
reset
==> 0003-net-
From ad9a0bf08ffbf32
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:37 +0200
Subject: net/mlx4_core: Refactor the catas flow to work per device
==> 0004-net-
From f6bc11e42646e66
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:38 +0200
Subject: net/mlx4_core: Enhance the catas flow to support device reset
==> 0005-net-
From f5aef5aa35063f2
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:39 +0200
Subject: net/mlx4_core: Activate reset flow upon fatal command cases
==> 0006-net-
From c69453e294c9f16
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:40 +0200
Subject: net/mlx4_core: Manage interface state for Reset flow cases
==> 0007-net-
From 2ba5fbd62b25343
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:41 +0200
Subject: net/mlx4_core: Handle AER flow properly
but to apply the whole series to include SRIOV EEH, then I need these extra packages:
==> 0008-g-mlx4.patch <==
From 225c6c8c6bbbc32
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:28 +0200
Subject: net/mlx4_core: Use correct variable type for mlx4_slave_cap
==> 0008-l-mlx4.patch <==
From de966c5928026b1
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:33 +0200
Subject: net/mlx4_core: Support more than 64 VFs
==> 0008-m-mlx4.patch <==
From 383677da43fa83b
From: Or Gerlitz <email address hidden>
Date: Thu, 11 Dec 2014 10:57:52 +0200
Subject: net/mlx4_core: Mask out host side virtualization features for guests
==> 0008-net-
From 55ad359225b2232
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:42 +0200
Subject: net/mlx4_core: Enable device recovery flow with SRIOV
==> 0008-n-mlx4.patch <==
From ddae0349fdb78bc
From: Eugenia Emantayev <email address hidden>
Date: Thu, 11 Dec 2014 10:57:54 +0200
Subject: net/mlx4: Change QP allocation scheme
==> 0008-o-mlx4.patch <==
From 431df8c7e970843
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:59 +0200
Subject: net/mlx4: Refactor QUERY_PORT
==> 0008-p-mlx4.patch <==
From ab256e5ad02b369
From: Dotan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:55 +0200
Subject: net/mlx4: Add a check if there are too many reserved QPs
==> 0008-r-mlx4.patch <==
From d57febe1a47801e
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:57 +0200
Subject: net/mlx4: Add A0 hybrid steering
==> 0008-s-mlx4.patch <==
From 7d077cd34eabb2f
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:58:00 +0200
Subject: net/mlx4: Add support for A0 steering
==> 0008-z-mlx4.patch <==
From 7a89399ffad7b7c
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:56 +0200
Subject: net/mlx4: Add mlx4_bitmap zone allocator
So then I can apply these
From 55ad359225b2232
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:42 +0200
Subject: net/mlx4_core: Enable device recovery flow with SRIOV
==> 0009-net-
From 0cd9302734111ab
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:43 +0200
Subject: net/mlx4_core: Reset flow activation upon SRIOV fatal command cases
So basically to apply the series will need a lot of patches and probably restest the driver.
tags: | added: architecture-ppc64le bugnameltc-121681 severity-high targetmilestone-inin1504 |
affects: | ubuntu → linux (Ubuntu) |
tags: | added: kernel-da-key |
Changed in linux (Ubuntu): | |
assignee: | nobody → Leann Ogasawara (leannogasawara) |
importance: | Undecided → High |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | In Progress → Fix Committed |
Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https:/ /wiki.ubuntu. com/Bugs/ FindRightPackag e. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.
To change the source package that this bug is filed about visit https:/ /bugs.launchpad .net/ubuntu/ +bug/1422481/ +editstatus and add the package name in the text box next to the word Package.
[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]