Intel E810 NICs driver in causing hangs when booting and bonds configured

Bug #2004262 reported by Jeff Hillman
72
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Heitor Alves de Siqueira
Jammy
Fix Released
High
Heitor Alves de Siqueira
Kinetic
Fix Released
High
Heitor Alves de Siqueira
Lunar
Confirmed
High
Heitor Alves de Siqueira

Bug Description

[Impact]
  * Intel E810-family NICs cause system hangs when booting with bonding enabled
  * This happens due to the driver unplugging auxiliary devices
  * The unplug event happens under RTNL lock context, which causes a deadlock where the RDMA driver waits for the RNL lock to complete removal

[Test Plan]
  * Users have reported that after setting up bonding on switch and server side, the system will hang when starting network services

[Fix]
  * The upstream patch defers unplugging/re-plugging of the auxiliary device, so that it's not performed under the RTNL lock context.
  * Fix was introduced by commit:
      248401cb2c46 ice: avoid bonding causing auxiliary plug/unplug under RTNL lock

[Regression Potential]
  * Regressions would manifest in devices that support RDMA functionality and
    have been added to a bond
  * We should look out for auxiliary devices that haven't been properly
    unplugged, or that cause further issues with
    ice_plug_aux_dev()/ice_unplug_aux_dev()

[Original Description]
jammy 22.04.1
linux-image-generic 5.15.0-58-generic
Intel E810-XXV Dual Port NICs in Dell PowerEdge 650

- 5.15 in jammy -> reproducible
- 5.19 in hwe-edge -> reproducible
- 6.2.rc6 in the mainline build -> works
- Intel's ice driver 1.10.1.2.2 -> works

After beonding is enabled on switch and server side, the system will hang at initialing ubuntu. The kernel loads but around starting the Network Services the system can hang for sometimes 5 minutes, and in other cases, indefinitely.

The message of:

echo 0 > /proc/sys/kernel/hung_task_timeout_sec”  systemd-resolve blocked for more than 120 seconds

appears, and eventually the Network services just attempts to start and never does. This is with or without DHCP enabled.

Tried this same setup with the hwe-22.04, hwe-20.04, hwe-22.04-ege and linux-oem kernels and all exhibit the same failure.

To work around this. installing the Intel 'ice' driver of version 1.10.1.2.2 works. The system doesn't even remotely hang at startup and all networking functions remain working (ping, DNS, general accessibility).

The driver can be found at https://downloadmirror.intel.com/763930/ice-1.10.1.2.2.tar.gz
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jan 31 13:08 seq
 crw-rw---- 1 root audio 116, 33 Jan 31 13:08 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
CasperMD5json:
 {
   "result": "skip"
 }DistroRelease: Ubuntu 22.04
InstallationDate: Installed on 2023-01-27 (3 days ago)InstallationMedia: Ubuntu-Server 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Dell Inc. PowerEdge R650
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-58-generic root=UUID=668aab7c-abe9-434b-a810-acc6eab76cbc ro fsck.mode=skip
ProcVersionSignature: Ubuntu 5.15.0-58.64-generic 5.15.74
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-58-generic N/A
 linux-backports-modules-5.15.0-58-generic N/A
 linux-firmware 20220329.git681281e4-0ubuntu3.9
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'Tags: jammy uec-images
Uname: Linux 5.15.0-58-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 09/14/2022
dmi.bios.release: 1.8
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.8.2
dmi.board.name: 0PJ7YJ
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.8.2:bd09/14/2022:br1.8:svnDellInc.:pnPowerEdgeR650:pvr:rvnDellInc.:rn0PJ7YJ:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=0912;ModelName=PowerEdgeR650:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R650
dmi.product.sku: SKU=0912;ModelName=PowerEdge R650
dmi.sys.vendor: Dell Inc.

CVE References

Revision history for this message
Jeff Hillman (jhillman) wrote :

subscribed field-high

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2004262

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Dimitri John Ledkov (xnox) wrote : Re: Intel E810 NICs driver in causinghangs when booting and bonds configured

Which driver is used by untainted generic kernels for the device in question? Can you attach sosreport, dmesg, etc?

Revision history for this message
Jeff Hillman (jhillman) wrote :

Testing the mainline kernel, 6.2.rc6, it has no issues and appears to boot fine after several attempts.

Revision history for this message
Jeff Hillman (jhillman) wrote : CurrentDmesg.txt

apport information

tags: added: apport-collected jammy uec-images
description: updated
Revision history for this message
Jeff Hillman (jhillman) wrote : HookError_ubuntu.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : Lspci.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : Lspci-vt.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : Lsusb.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : Lsusb-t.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : Lsusb-v.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : ProcEnviron.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : ProcModules.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : UdevDb.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : WifiSyslog.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : acpidump.txt

apport information

Revision history for this message
Jeff Hillman (jhillman) wrote : Re: Intel E810 NICs driver in causinghangs when booting and bonds configured
Download full text (3.9 KiB)

Dmesg doesn't show, and modinfo doesn't give a version either. I did run the apport command, but with the newer driver installed.

dmesg doesn't show driver version either.

I can say that in the focal latest kernel it is version 0.8.7-k, for what that's worth..

modinfo ice from a different jammy machine that is untained with same kernel with no bonding:

filename: /lib/modules/5.15.0-58-generic/kernel/drivers/net/ethernet/intel/ice/ice.ko
firmware: intel/ice/ddp/ice.pkg
license: GPL v2
description: Intel(R) Ethernet Connection E800 Series Linux Driver
author: Intel Corporation, <email address hidden>
srcversion: 33C12CE56C291547C82D414
alias: pci:v00008086d0000151Dsv*sd*bc*sc*i*
alias: pci:v00008086d0000124Fsv*sd*bc*sc*i*
alias: pci:v00008086d0000124Esv*sd*bc*sc*i*
alias: pci:v00008086d0000124Dsv*sd*bc*sc*i*
alias: pci:v00008086d0000124Csv*sd*bc*sc*i*
alias: pci:v00008086d0000189Asv*sd*bc*sc*i*
alias: pci:v00008086d00001899sv*sd*bc*sc*i*
alias: pci:v00008086d00001898sv*sd*bc*sc*i*
alias: pci:v00008086d00001897sv*sd*bc*sc*i*
alias: pci:v00008086d00001894sv*sd*bc*sc*i*
alias: pci:v00008086d00001893sv*sd*bc*sc*i*
alias: pci:v00008086d00001892sv*sd*bc*sc*i*
alias: pci:v00008086d00001891sv*sd*bc*sc*i*
alias: pci:v00008086d00001890sv*sd*bc*sc*i*
alias: pci:v00008086d0000188Esv*sd*bc*sc*i*
alias: pci:v00008086d0000188Dsv*sd*bc*sc*i*
alias: pci:v00008086d0000188Csv*sd*bc*sc*i*
alias: pci:v00008086d0000188Bsv*sd*bc*sc*i*
alias: pci:v00008086d0000188Asv*sd*bc*sc*i*
alias: pci:v00008086d0000159Bsv*sd*bc*sc*i*
alias: pci:v00008086d0000159Asv*sd*bc*sc*i*
alias: pci:v00008086d00001599sv*sd*bc*sc*i*
alias: pci:v00008086d00001593sv*sd*bc*sc*i*
alias: pci:v00008086d00001592sv*sd*bc*sc*i*
alias: pci:v00008086d00001591sv*sd*bc*sc*i*
depends:
retpoline: Y
intree: Y
name: ice
vermagic: 5.15.0-58-generic SMP mod_unload modversions
sig_id: PKCS#7
signer: Build time autogenerated kernel key
sig_key: 22:65:46:2C:C7:1D:FA:BF:DE:CC:81:3A:37:95:6F:B4:CF:9A:8C:B8
sig_hashalgo: sha512
signature: 67:A2:9C:EA:C4:AD:FE:8B:DF:E3:12:15:18:80:70:FD:1A:13:2C:CE:
  F6:D5:08:F2:10:3B:94:7A:4B:8D:FC:FF:A4:4B:76:32:2E:F8:EC:32:
  BA:36:F7:31:E2:28:B2:5F:0C:7B:BC:82:6F:A9:7D:D4:57:6D:93:A8:
  7D:E7:25:17:29:63:F1:E0:7A:3E:49:38:B7:F9:C9:D8:3F:ED:D9:B5:
  F8:A0:BE:B1:14:3C:75:5E:C6:56:71:37:8F:1D:DC:0F:D2:77:76:2B:
  0A:A3:9B:AA:3D:58:86:C9:6D:79:30:E5:46:8A:8E:4C:51:BF:2A:8A:
  33:18:5D:50:AB:3D:2A:75:19:B9:A5:E3:85:5D:35:81:60:8C:1D:A3:
  4B:98:FA:92:98:02:EB:0D:85:6B:BE:AD:B3:D1:1D:0C:50:B8:5B:ED:
  54:83:46:DD:5E:E1:37:1D:11:09:96:5F:71:04:AF:BA:02:61:54:27:
  22:51:36:36:35:E4:82:29:C3:09:C8:3F:41:0A:2E:3F:1A:60:FC:49:
  72:87:6D:A0:51:DA:2B:41:FD:7F:ED:FD:1D:05:86:B2:36:6A:9E:19:
  BF:D6:8D:47:DC:04:BD:04:32:60:0E:36:38:78:AF:57:4A:7B:A1:C2:
  33:72:FA:42:FC:42:BD:BA:26:9A:11:60:5E:8D:F5:01:B5:C2:29:12:
  6E:B8:66:46:B3:5F:EA:BF:23:29:F2:E1:B3:2E:7D:34:CA...

Read more...

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hi,

The attached dmesg seems to be from a clean reboot and doesn't contain any information about the hangs. Is there any logs available from an affected boot that provides more information than the systemd-resolved hung task message? The systemd log messages would also help (e.g. 'journalctl -u systemd-resolved').

Michael Reed (mreed8855)
summary: - Intel E810 NICs driver in causinghangs when booting and bonds configured
+ Intel E810 NICs driver in causing hangs when booting and bonds
+ configured
Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

https://<email address hidden>/T/#u

A possible fix for this problem. The patch was posted on intel-wired-lan a couple weeks ago and just hit netdev today.

Nobuto Murata (nobuto)
description: updated
Revision history for this message
Jeff Hillman (jhillman) wrote :

Using the patch that Jay mentioned above, as well as following his direction in getting the kernel compiled, after 5 reboots this appears to resolve the issue.

from watching this reboot many times both with the stock kernel and the upstream intel driver, i can see that this acts more like the latter.

with the stock kernel, whether it would continue to boot or not, there was also a hang at starting the network.

With this patch, and just like the upstream intel driver, there is no blip. it simply boots.

Once the system was up, i could verify connectivity on the bonds by reaching both adjacent endpoints as well as upstream targets.

The process resulting in the 5.15.60+ kernel version

Revision history for this message
Jeff Lane  (bladernr) wrote :

Update from upstream:

The patch (v2) was dropped because a customer reported it did not fix their issue. I just reached out to the developer for an ETA to fix it and will keep you posted.

Additionally there is ongoing discussion around the V2 of that patch here:
https://lore.kernel.org/intel-wired-lan/ygay1oxikvo.fsf@localhost/T/#m31a3f2847d013a41104524bd81e7f2fa481b4b97

James Troup (elmo)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Bartosz Woronicz (mastier1) wrote :

It seems like it is heavily related to bond driver.
Here's another case
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2008781
As this might be related.

description: updated
Changed in linux (Ubuntu Jammy):
status: New → Confirmed
Changed in linux (Ubuntu Kinetic):
status: New → Confirmed
Changed in linux (Ubuntu Jammy):
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in linux (Ubuntu Kinetic):
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in linux (Ubuntu Lunar):
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in linux (Ubuntu Jammy):
importance: Undecided → High
Changed in linux (Ubuntu Kinetic):
importance: Undecided → High
Changed in linux (Ubuntu Lunar):
importance: Undecided → High
Stefan Bader (smb)
Changed in linux (Ubuntu Kinetic):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Jammy):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.19.0-41.42 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux verification-needed-kinetic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.15.0-72.79 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux verification-needed-jammy
Revision history for this message
Olivier FAURAX (olivier-faurax) wrote :

5.15.0-72 boots OK on affected hardware.

root@m3-small-x86-01:~# lspci|grep E810
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
root@m3-small-x86-01:~# uname -a
Linux m3-small-x86-01 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@m3-small-x86-01:~# ping 8.8.8.8 -c 4
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=0.640 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=0.708 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=0.736 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=119 time=0.734 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3062ms
rtt min/avg/max/mdev = 0.640/0.704/0.736/0.038 ms

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Olivier FAURAX (olivier-faurax) wrote :

On current 23.04 (lunar):
* 6.2.0-20 doesn't work
* 6.2.0-21 doesn't work

Revision history for this message
Olivier FAURAX (olivier-faurax) wrote :

Works for 5.19.0-42:

root@m3-small-x86-01:~# uname -a
Linux m3-small-x86-01 5.19.0-42-generic #43-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 18:21:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@m3-small-x86-01:~# lspci|grep E810
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
root@m3-small-x86-01:~# ping 8.8.8.8 -c 4
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=0.712 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=0.690 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=0.706 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=119 time=0.767 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.690/0.718/0.767/0.029 ms

tags: added: verification-done-kinetic
removed: verification-needed-kinetic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-5.19/5.19.0-1010.10 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-5.19 verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Olivier FAURAX (olivier-faurax) wrote :

5.19.0-1010.10 works OK

root@m3-small-x86-01:~# uname -a
Linux m3-small-x86-01 5.19.0-1010-nvidia #10-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 25 23:39:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@m3-small-x86-01:~# lspci|grep E810
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
root@m3-small-x86-01:~# ping 8.8.8.8 -c 4
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=0.706 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=0.774 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=0.729 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=119 time=0.703 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3076ms
rtt min/avg/max/mdev = 0.703/0.728/0.774/0.028 ms

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (23.7 KiB)

This bug was fixed in the package linux - 5.15.0-72.79

---------------
linux (5.15.0-72.79) jammy; urgency=medium

  * jammy/linux: 5.15.0-72.79 -proposed tracker (LP: #2016548)

  * Add split lock detection for EMR (LP: #2015855)
    - x86/split_lock: Enumerate architectural split lock disable bit

  * selftest: fib_tests: Always cleanup before exit (LP: #2015956)
    - selftest: fib_tests: Always cleanup before exit

  * Add support for intel EMR cpu (LP: #2015372)
    - platform/x86: intel-uncore-freq: add Emerald Rapids support
    - perf/x86/intel/cstate: Add Emerald Rapids
    - perf/x86/rapl: Add support for Intel Emerald Rapids
    - intel_idle: add Emerald Rapids Xeon support
    - tools/power/x86/intel-speed-select: Add Emerald Rapid quirk
    - tools/power turbostat: Introduce support for EMR
    - powercap: intel_rapl: add support for Emerald Rapids
    - EDAC/i10nm: Add Intel Emerald Rapids server support

  * Kernel livepatch ftrace graph fix (LP: #2013603)
    - kprobes: treewide: Remove trampoline_address from
      kretprobe_trampoline_handler()
    - kprobes: treewide: Make it harder to refer kretprobe_trampoline directly
    - kprobes: Add kretprobe_find_ret_addr() for searching return address
    - s390/unwind: recover kretprobe modified return address in stacktrace
    - s390/unwind: fix fgraph return address recovery

  * Jammy update: v5.15.98 upstream stable release (LP: #2015600)
    - Linux 5.15.98

  * Jammy update: v5.15.97 upstream stable release (LP: #2015599)
    - ionic: refactor use of ionic_rx_fill()
    - Fix XFRM-I support for nested ESP tunnels
    - arm64: dts: rockchip: drop unused LED mode property from rk3328-roc-cc
    - ARM: dts: rockchip: add power-domains property to dp node on rk3288
    - HID: elecom: add support for TrackBall 056E:011C
    - ACPI: NFIT: fix a potential deadlock during NFIT teardown
    - btrfs: send: limit number of clones and allocated memory size
    - ASoC: rt715-sdca: fix clock stop prepare timeout issue
    - IB/hfi1: Assign npages earlier
    - neigh: make sure used and confirmed times are valid
    - HID: core: Fix deadloop in hid_apply_multiplier.
    - x86/cpu: Add Lunar Lake M
    - staging: mt7621-dts: change palmbus address to lower case
    - bpf: bpf_fib_lookup should not return neigh in NUD_FAILED state
    - net: Remove WARN_ON_ONCE(sk->sk_forward_alloc) from sk_stream_kill_queues().
    - vc_screen: don't clobber return value in vcs_read
    - scripts/tags.sh: Invoke 'realpath' via 'xargs'
    - scripts/tags.sh: fix incompatibility with PCRE2
    - usb: dwc3: pci: add support for the Intel Meteor Lake-M
    - USB: serial: option: add support for VW/Skoda "Carstick LTE"
    - usb: gadget: u_serial: Add null pointer check in gserial_resume
    - USB: core: Don't hold device lock while reading the "descriptors" sysfs file
    - Linux 5.15.97

  * Jammy update: v5.15.96 upstream stable release (LP: #2015595)
    - drm/etnaviv: don't truncate physical page address
    - wifi: rtl8xxxu: gen2: Turn on the rate control
    - drm/edid: Fix minimum bpc supported with DSC1.2 for HDMI sink
    - clk: mxl: Switch from direct readl/writel based IO to regmap based IO
    - ...

Changed in linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (72.0 KiB)

This bug was fixed in the package linux - 5.19.0-42.43

---------------
linux (5.19.0-42.43) kinetic; urgency=medium

  * kinetic/linux: 5.19.0-42.43 -proposed tracker (LP: #2016503)

  * selftest: fib_tests: Always cleanup before exit (LP: #2015956)
    - selftest: fib_tests: Always cleanup before exit

  * Debian autoreconstruct Fix restoration of execute permissions (LP: #2015498)
    - [Debian] autoreconstruct - fix restoration of execute permissions

  * Kinetic update: upstream stable patchset 2023-04-10 (LP: #2015812)
    - drm/etnaviv: don't truncate physical page address
    - wifi: rtl8xxxu: gen2: Turn on the rate control
    - drm/edid: Fix minimum bpc supported with DSC1.2 for HDMI sink
    - clk: mxl: Switch from direct readl/writel based IO to regmap based IO
    - clk: mxl: Remove redundant spinlocks
    - clk: mxl: Add option to override gate clks
    - clk: mxl: Fix a clk entry by adding relevant flags
    - powerpc: dts: t208x: Mark MAC1 and MAC2 as 10G
    - clk: mxl: syscon_node_to_regmap() returns error pointers
    - random: always mix cycle counter in add_latent_entropy()
    - KVM: x86: Fail emulation during EMULTYPE_SKIP on any exception
    - KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid
    - can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len
    - powerpc: dts: t208x: Disable 10G on MAC1 and MAC2
    - powerpc/vmlinux.lds: Ensure STRICT_ALIGN_SIZE is at least page aligned
    - powerpc/64s/radix: Fix RWX mapping with relocated kernel
    - uaccess: Add speculation barrier to copy_from_user()
    - wifi: mwifiex: Add missing compatible string for SD8787
    - audit: update the mailing list in MAINTAINERS
    - ext4: Fix function prototype mismatch for ext4_feat_ktype
    - Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo
      child qdiscs"
    - bpf: add missing header file include
    - wifi: ath11k: fix warning in dma_free_coherent() of memory chunks while
      recovery
    - sched/psi: Stop relying on timer_pending() for poll_work rescheduling
    - docs: perf: Fix PMU instance name of hisi-pcie-pmu
    - randstruct: disable Clang 15 support
    - ionic: refactor use of ionic_rx_fill()
    - Fix XFRM-I support for nested ESP tunnels
    - arm64: dts: rockchip: drop unused LED mode property from rk3328-roc-cc
    - ARM: dts: rockchip: add power-domains property to dp node on rk3288
    - HID: elecom: add support for TrackBall 056E:011C
    - ACPI: NFIT: fix a potential deadlock during NFIT teardown
    - btrfs: send: limit number of clones and allocated memory size
    - ASoC: rt715-sdca: fix clock stop prepare timeout issue
    - IB/hfi1: Assign npages earlier
    - neigh: make sure used and confirmed times are valid
    - HID: core: Fix deadloop in hid_apply_multiplier.
    - x86/cpu: Add Lunar Lake M
    - bpf: bpf_fib_lookup should not return neigh in NUD_FAILED state
    - net: Remove WARN_ON_ONCE(sk->sk_forward_alloc) from sk_stream_kill_queues().
    - vc_screen: don't clobber return value in vcs_read
    - scripts/tags.sh: fix incompatibility with PCRE2
    - usb: dwc3: pci: add support for the Intel Meteor Lake-M
    - USB: serial: option: add suppo...

Changed in linux (Ubuntu Kinetic):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-gcp/5.19.0-1024.26 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux-gcp verification-needed-kinetic
removed: verification-done-kinetic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-riscv-5.15/5.15.0-1034.38~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-riscv-5.15 verification-needed-focal
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-allwinner/5.19.0-1012.12 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux-allwinner
Revision history for this message
Jeff Hillman (jhillman) wrote : Re: [Bug 2004262] Re: Intel E810 NICs driver in causing hangs when booting and bonds configured
Download full text (5.9 KiB)

I am no longer in a poaition to recreate the scenario.

On Mon, May 22, 2023, 2:38 AM Ubuntu Kernel Bot <email address hidden>
wrote:

> This bug is awaiting verification that the linux-
> allwinner/5.19.0-1012.12 kernel in -proposed solves the problem. Please
> test the kernel and update this bug with the results. If the problem is
> solved, change the tag 'verification-needed-kinetic' to 'verification-
> done-kinetic'. If the problem still exists, change the tag
> 'verification-needed-kinetic' to 'verification-failed-kinetic'.
>
> If verification is not done by 5 working days from today, this fix will
> be dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
> to enable and use -proposed. Thank you!
>
>
> ** Tags added: kernel-spammed-kinetic-linux-allwinner
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2004262
>
> Title:
> Intel E810 NICs driver in causing hangs when booting and bonds
> configured
>
> Status in linux package in Ubuntu:
> Confirmed
> Status in linux source package in Jammy:
> Fix Released
> Status in linux source package in Kinetic:
> Fix Released
> Status in linux source package in Lunar:
> Confirmed
>
> Bug description:
> [Impact]
> * Intel E810-family NICs cause system hangs when booting with bonding
> enabled
> * This happens due to the driver unplugging auxiliary devices
> * The unplug event happens under RTNL lock context, which causes a
> deadlock where the RDMA driver waits for the RNL lock to complete removal
>
> [Test Plan]
> * Users have reported that after setting up bonding on switch and
> server side, the system will hang when starting network services
>
> [Fix]
> * The upstream patch defers unplugging/re-plugging of the auxiliary
> device, so that it's not performed under the RTNL lock context.
> * Fix was introduced by commit:
> 248401cb2c46 ice: avoid bonding causing auxiliary plug/unplug
> under RTNL lock
>
> [Regression Potential]
> * Regressions would manifest in devices that support RDMA
> functionality and
> have been added to a bond
> * We should look out for auxiliary devices that haven't been properly
> unplugged, or that cause further issues with
> ice_plug_aux_dev()/ice_unplug_aux_dev()
>
>
> [Original Description]
> jammy 22.04.1
> linux-image-generic 5.15.0-58-generic
> Intel E810-XXV Dual Port NICs in Dell PowerEdge 650
>
> - 5.15 in jammy -> reproducible
> - 5.19 in hwe-edge -> reproducible
> - 6.2.rc6 in the mainline build -> works
> - Intel's ice driver 1.10.1.2.2 -> works
>
> After beonding is enabled on switch and server side, the system will
> hang at initialing ubuntu. The kernel loads but around starting the
> Network Services the system can hang for sometimes 5 minutes, and in
> other cases, indefinitely.
>
> The message of:
>
> echo 0 > /proc/sys/kernel/hung_task_timeout_sec” systemd-resolve
> blocked for more than 120 seconds
>
> appears, and eventually the Network services just attempts to start
> and nev...

Read more...

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-intel-iotg/5.15.0-1031.36 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-intel-iotg verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws/5.19.0-1027.28 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux-aws
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws/5.15.0-1038.43 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-aws
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.15.0-1040.47 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.19.0-1028.31 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux-azure
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws-5.15/5.15.0-1046.51~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-done-focal-linux-aws-5.15'. If the problem still exists, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-failed-focal-linux-aws-5.15'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-aws-5.15-v2 verification-needed-focal-linux-aws-5.15
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):
Download full text (9.1 KiB)

I'm having a very similar issue with the same hardware. Do you think it might be the same problem? If it is, then it was not actually fixed in jammy (I'm using a kernel that supposedly have it already fixed).

- same hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
- ubuntu 22.04: `5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux`
- using a bond over the two ports of the same card, at 25Gbps to two different switches
- bond is using LACP with hash layer3+4 and fast timeout
- machine installed by maas. No issues during installation, but at that time bond is not formed yet
- later when installed linix is booted, the bond is up and working without issues
- it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing)
- one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet
- after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace
- the switch does log that the bond is flapping

[ 6337.489648] ------------[ cut here ]------------
[ 6337.489653] NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out
[ 6337.489663] WARNING: CPU: 12 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x277/0x280
[ 6337.489669] Modules linked in: nf_conntrack_netlink geneve ip6_udp_tunnel udp_tunnel xt_CT dm_crypt scsi_transport_iscsi veth nfnetlink_cttimeout openvswitch nsh nf_conncount unix_diag nft_masq zfs(PO) zunico
de(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_t
cpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge sunrpc nvme_fabrics 8021q garp mrp stp llc bonding tls intel_rapl_msr intel_rapl_common amd
64_edac edac_mce_amd ipmi_ssif binfmt_misc kvm_amd kvm dell_wmi ledtrig_audio sparse_keymap video nls_iso8859_1 rapl irdma dell_smbios dcdbas i40e wmi_bmof dell_wmi_descriptor ib_uverbs ib_core ccp ptdma k10temp
 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops reed_solomon
[ 6337.489754] pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor...

Read more...

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :
Download full text (6.6 KiB)

Switched to HWE kernel on jammy (6.2.0-32-generic #32~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 18 10:40:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux) and still basically the same issue:

[33219.508873] ------------[ cut here ]------------
[33219.508877] NETDEV WATCHDOG: enp161s0f1 (ice): transmit queue 35 timed out
[33219.508932] WARNING: CPU: 48 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x21f/0x230
[33219.508940] Modules linked in: sch_ingress nf_conntrack_netlink geneve ip6_udp_tunnel udp_tunnel xt_CT dm_crypt scsi_transport_iscsi veth nfnetlink_cttimeout openvswitch nsh nf_conncount unix_diag nft_masq zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge sunrpc nvme_fabrics 8021q garp mrp stp llc bonding tls binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd dell_wmi kvm_amd video ledtrig_audio nls_iso8859_1 irdma sparse_keymap kvm i40e irqbypass dell_smbios dcdbas ib_uverbs rapl dell_wmi_descriptor wmi_bmof ib_core ccp ptdma k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops
[33219.509051] reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cdc_ether usbnet mii mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect sysimgblt crc32_pclmul bcache polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme aesni_intel crypto_simd nvme_core ahci xhci_pci cryptd ice tg3 libahci drm megaraid_sas i2c_piix4 xhci_pci_renesas nvme_common wmi
[33219.509114] CPU: 48 PID: 0 Comm: swapper/48 Tainted: P O 6.2.0-32-generic #32~22.04.1-Ubuntu
[33219.509116] Hardware name: Dell Inc. PowerEdge R7525/03WYW4, BIOS 2.12.4 07/26/2023
[33219.509118] RIP: 0010:dev_watchdog+0x21f/0x230
[33219.509122] Code: 00 e9 31 ff ff ff 4c 89 e7 c6 05 66 83 78 01 01 e8 56 00 f8 ff 44 89 f1 4c 89 e6 48 c7 c7 08 4f e4 b7 48 89 c2 e8 61 df 2b ff <0f> 0b e9 22 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[33219.509123] RSP: 0018:ffffb42719fd0e70 EFLAGS: 00010246
[33219.509125] RAX: 0000000000000000 RBX: ffff9bd91b3e74c8 RCX: 0000000000000000
[33219.509126] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[33219.509127] RBP: ffffb42719fd0e98 R08: 0000000000000000 R09: 0000000000000000
[33219.509128] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9bd91b3e7000
[33219.509129] R13: ffff9bd91b3e741c R14: 0000000000000023 R15: 0000000000000000
[33219.509130] FS: 0000000000000000(0000) GS:ffff9b573de00000(0000) knlGS:0000000000000000
[33219.509132] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[33219.509133] CR2: 000055fd64034000 CR3: 0000010273ae2004 CR4: 0000000000770ee0
[33219.509135] PKRU: 55555554
[33219.5091...

Read more...

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

This seems different enough, I'll open a separate report for it. Thanks and sorry for the noise.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.