In kernel ixgbevf driver locks up on AWS M4 instances(among others)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-images |
Incomplete
|
Undecided
|
Stefan Bader |
Bug Description
Creating this bug to separate the issue from this bug report https:/
In that bug report people called out possible issues with the ixbgevf driver included in Ubuntu 14.04. I came across this while researching a customer issue, I was hoping to find known issues with the driver.
The issue is that after operating normally for some time. The AWS instance seems to lose all network connectivity. The console log shows now errors but we did note that after a reboot of the instance. DHCP was failing to renew it's lease (seen in /var/logs multiple attempts). Which would explain the connectivity loss, no IP address after lease expires.
Unfortunately I do not have access to the customers AMI/image or a working reproduction of this. AWS being predominately is an IaaS provider, which means this is typically the case for most issues I work on with our customers. Unless the customer is willing to provide me a copy of there data/image. Even if they did want to, I usually don't ask if I suspected it was workload related because simulating a customer workload is often complex/difficult. In this case I suspected it would be workload related/triggered, given the instances failed randomly after having worked fine for hours.
Regardless, the evidence of a problem was clear. The customer never had these problems before on older instance types which did not provide Intel SR-IOV using the same identical image/AMI. After moving to m4s they started to randomly get instances failing in their auto scaling groups(ASGs). After working with them the pattern of failures was clear(they had multiple ASGs), only the ASGs using m4s had these random instance failures and looking at the console log the main difference I noted was the change in network driver on the m4s. I provided them detailed instruction on compiling the latest out of tree driver(at the time this was 2.16.1 https:/
Given the in kernel driver has additional fixes now in the upstream source. The only thing I can suggest is looking there for clues on what was done to address it in later releases.
https:/
Specifically I think the issue is fixed in kernels which have version 2.12.1-k or later(as OSes running this seem fine). As per my post at the time Ubuntu 14.04 was running 2.11.3-k. This was the change in version from 2.11.3-k to 2.12.1-k in the Linux tree. Whether the fix for the issue is between these two version of one of the changes which landed since(without a version bump) I'm not sure.
https:/
If the in kernel driver version number tracked the out of tree driver version it would probably be easier to pinpoint. Though people from Intel told me the in Kernel driver is at parity with the latest out of tree version, the version number does not reflect this which is a little confusing.
Changed in ubuntu-on-ec2: | |
assignee: | nobody → Stefan Bader (smb) |
affects: | ubuntu-on-ec2 → cloud-images |
smb, Here is the new bug for the EC2 M4 work issue while using the in-tree ixgbevf driver.