I have tried, unsuccessfully, to reproduce this issue internally. Details of my setup below. 1) I have a pair of Dell R210 servers racked (u072 and u073 below), each with a BCM57416 installed: root@u072:~# lspci | grep BCM57416 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) 2) I've matched the firmware version to one that Nivedita reported in a bad system: root@u072:~# ethtool -i enp1s0f0np0 driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no 3) Matched Ubuntu release and kernel version: root@u072:~# lsb_release -dr Description: Ubuntu 18.04.3 LTS Release: 18.04 root@u072:~# uname -a Linux u072 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux 4) Configured the interface into an active-backup bond: root@u072:~# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp1s0f1np1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: enp1s0f1np1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 1 Permanent HW addr: 00:0a:f7:a7:10:61 Slave queue ID: 0 Slave Interface: enp1s0f0np0 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 1 Permanent HW addr: 00:0a:f7:a7:10:60 Slave queue ID: 0 5) Run the provided mtr and netperf test cases with the 1st port selected as active: root@u072:~# ip l set enp1s0f1np1 down root@u072:~# ip l set enp1s0f1np1 up root@u072:~# cat /proc/net/bonding/bond0 | grep Active Currently Active Slave: enp1s0f0np0 a) initiated on u072: root@u072:~# mtr --no-dns --report --report-cycles 60 192.168.1.2 Start: 2020-02-13T20:48:01+0000 HOST: u072 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.2 0.0% 60 0.2 0.2 0.2 0.2 0.0 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 1 1 10.00 29040.91 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 64 64 10.00 28633.36 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 128 8192 10.00 17469.30 16384 87380 b) initiated on u073: root@u073:~# mtr --no-dns --report --report-cycles 60 192.168.1.1 Start: 2020-02-13T20:53:37+0000 HOST: u073 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.1 0.0% 60 0.1 0.1 0.1 0.2 0.0 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 28514.93 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 64 64 10.00 27405.88 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 128 8192 10.00 17342.42 16384 131072 6) Run the provided mtr and netperf test cases with the 2nd port selected as active: root@u072:~# ip l set enp1s0f0np0 down root@u072:~# ip l set enp1s0f0np0 up root@u072:~# cat /proc/net/bonding/bond0 | grep Active Currently Active Slave: enp1s0f1np1 a) initiated on u072: root@u072:~# mtr --no-dns --report --report-cycles 60 192.168.1.2 Start: 2020-02-13T21:07:36+0000 HOST: u072 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.2 0.0% 60 0.2 0.2 0.1 0.2 0.0 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 1 1 10.00 28649.85 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 64 64 10.00 27053.55 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 128 8192 10.00 16706.59 16384 87380 b) initiated on u073: root@u073:~# mtr --no-dns --report --report-cycles 60 192.168.1.1 Start: 2020-02-13T21:12:54+0000 HOST: u073 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.1 0.0% 60 0.1 0.1 0.1 0.2 0.0 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 27782.73 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 64 64 10.00 26645.73 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 128 8192 10.00 17499.00 16384 131072 As can be seen above, I don't see the same behavior. The big difference in my setup is obviously the host, but I would be suprised if that were a factor since the issue has been seen on vastly different host hardware configurations above. Any other differences I could have missed between the "Bad" system above and mine? I am somewhat concerned about the rx_stat_discards, but suspect they're down below the noise floor for the issue. Could you nevertheless please carefully capture more ethtool stats on the production system. From before and after the test for each of the bond leg interfaces (4 captures, I'm interested in the deltas that are due to the test). Given this is a production system, what else is running that might have an influence? I don't see drops on the PCIe rings from the stats, so it doesn't look like the host isn't keeping up, but perhaps you could dump CPU utilization during the test too?