We can trigger the issue with high load (benchmarking Ceph cluster with fio: 4 clients, 8 threads, iodepth 256, 100% random write, 64K block size).
Only when we use relatively large block size (64K) do we hit this problem. With 4K blocks we do not hit this issue. We haven't tested large random reads (that test is still to be done).
When using openvswitch port-channel (as we do) with jumbo frames ... this port-channel will not come back online after the reset. rmmod i40e / modprobe i40e does the trick though.
H there. I can confirm this problem still exists in newest kernels and with the latest intel drivers as of today:
Jan 19 16:05:19 osd9 kernel: [511271.581413] i40e 0000:02:00.1: TX driver issue detected, PF reset issued
Jan 19 16:09:08 osd9 kernel: [511500.919380] i40e 0000:02:00.0: TX driver issue detected, PF reset issued
driver: i40e-2.4.3 (and xenial / 4.13 shipped driver: 2.1.14-k)
kernel: 4.13.0-25-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 12:16:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux. Kernel loaded with nopti noibrs noibpb (Meltdown / Spetre mitigation disabled).
We can trigger the issue with high load (benchmarking Ceph cluster with fio: 4 clients, 8 threads, iodepth 256, 100% random write, 64K block size).
Only when we use relatively large block size (64K) do we hit this problem. With 4K blocks we do not hit this issue. We haven't tested large random reads (that test is still to be done).
When using openvswitch port-channel (as we do) with jumbo frames ... this port-channel will not come back online after the reset. rmmod i40e / modprobe i40e does the trick though.