e1000e in 4.4.0-97-generic breaks 82574L under heavy load.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Medium
|
Joseph Salisbury | ||
Xenial |
Fix Released
|
Medium
|
Joseph Salisbury | ||
Zesty |
Won't Fix
|
Medium
|
Joseph Salisbury | ||
Artful |
Fix Released
|
Medium
|
Joseph Salisbury |
Bug Description
== SRU Justification ==
This issue was first reported on the netdev email list by Lennart Sorensen:
https://<email address hidden>
Commit 16ecba59bc333d6
"Unfortunately this commit changed the driver to assume
that the Other Causes interrupt can only mean link state change and
hence sets the flag that (unfortunately) means both link is down and link
state should be checked. Since this now happens 3000 times per second,
the chances of it happening while the watchdog_task is checking the link
state becomes pretty high, and it if does happen to coincice, then the
watchdog_task will reset the adapter, which causes a real loss of link."
The original reported experienced this issue on a Supermicro X7SPA-HF-D525 server board.
However, the bug is now seen on many servers running X9DBL-1F server boards.
This bug is fixed by commits 19110cfbb34 and 4aea7a5c5e9, which were both added
to mainline in v4.15-rc1.
The commit that introduced this bug,16ecba5, was added to mainlien in v4.5-rc1. However,
Xenial recived this commit as well as commit 531ff577a. Bionic master-next does not need
these commits, since it got them via bug 1735843 and the 4.14.3 updates.
== Fixes ==
19110cfbb34 ("e1000e: Separate signaling for link check/link up")
4aea7a5c5e9 ("e1000e: Avoid receiver overrun interrupt bursts")
== Regression Potential ==
These commits are specific to e1000.
== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
== Original Bug Descriptio ==
This issue was first reported on the netdev email list by Lennart Sorensen:
https://<email address hidden>
Commit 16ecba59bc333d6
"Unfortunately this commit changed the driver to assume
that the Other Causes interrupt can only mean link state change and
hence sets the flag that (unfortunately) means both link is down and link
state should be checked. Since this now happens 3000 times per second,
the chances of it happening while the watchdog_task is checking the link
state becomes pretty high, and it if does happen to coincice, then the
watchdog_task will reset the adapter, which causes a real loss of link."
A fix for this issue was accepted into the net-next branch, along with other e1000e/igb patches: https:/
The original reported experienced this issue on a Supermicro X7SPA-HF-D525 server board. We see this issue on many servers running X9DBL-1F server boards. Both boards use the Intel 82574L for the network interfaces. We see messages like this under heavy load:
[Nov 6 15:42] e1000e: eth0 NIC Link is Down
[ +0.001670] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Nov 6 16:10] e1000e: eth0 NIC Link is Down
[ +0.008505] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Nov 7 00:49] e1000e: eth0 NIC Link is Down
[ +2.235111] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
We have confirmed that the connected switch sees the link drops also, to these are not false alarms from the e1000e driver.
# lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04
I could not cleanly apply the net-next patch to 4.4.0 so I tested with just the following cherry picked changes on the latest 4.4.0 kernel source package.
https:/
https:/
https:/
https:/
https:/
Although it's my understanding the first two are the critical ones for the race condition. I have been running with the patches e1000e kernel driver, under network load for 7 days and I no longer see the network interface drops.
Could we pull these changes into the Ubuntu 4.4.0 kernel ?
Thanks
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Jul 19 07:34 seq
crw-rw---- 1 root audio 116, 33 Jul 19 07:34 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 003: ID 0557:2221 ATEN International Co., Ltd Winbond Hermon
Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Supermicro X9DBL-3F/X9DBL-iF
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
LANG=en_GB.UTF-8
SHELL=/bin/bash
ProcFB:
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 1.157.11
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial xenial
Uname: Linux 4.4.0-83-generic x86_64
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: Upgraded to xenial on 2016-12-05 (337 days ago)
UserGroups:
_MarkForUpload: False
dmi.bios.date: 12/28/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.00
dmi.board.
dmi.board.name: X9DBL-3F/X9DBL-iF
dmi.board.vendor: Supermicro
dmi.board.version: 0123456789
dmi.chassis.
dmi.chassis.type: 3
dmi.chassis.vendor: Supermicro
dmi.chassis.
dmi.modalias: dmi:bvnAmerican
dmi.product.name: X9DBL-3F/X9DBL-iF
dmi.product.
dmi.sys.vendor: Supermicro
tags: | added: apport-collected |
description: | updated |
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
Changed in linux (Ubuntu): | |
importance: | Undecided → Medium |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu Xenial): | |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | Confirmed → In Progress |
Changed in linux (Ubuntu Xenial): | |
importance: | Undecided → Medium |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu Zesty): | |
status: | In Progress → Won't Fix |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Artful): | |
status: | In Progress → Fix Committed |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1730550
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.