Random bond down on nodes with 6 interfaces during deployment

Bug #1657750 reported by Sergey Galkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Fuel Sustaining
Nominated for Ocata by Oleksiy Molchanov
Mitaka
Confirmed
High
Fuel Sustaining
Newton
Confirmed
High
Fuel Sustaining

Bug Description

Steps for reproduce:
1. Install 9.0
2. Upgrade to 9.2 from http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2017-01-13-184421/x86_64
3. Deploy big cluster (~400 nodes in my case) with 2 nodes with 4 interfaces and 6 interfaces with bonding configured as described on screenshots

During deploy random 6 interfaces nodes gone offline. In last case node-2168 and node-2265

node-2168

root@node-2168# ip a show bond0
8: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue master br-fw-admin state DOWN group default qlen 1000

root@node-2168:~# cat /proc/net/bonding/bond0 | grep MII
MII Status: down
MII Polling Interval (ms): 1000
MII Status: up
MII Status: up
MII Status: up
MII Status: up

node-2265
root@node-2265:/var/log# ip a show bond0
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-fw-admin state UP group default qlen 1000

root@node-2265:/var/log# cat /proc/net/bonding/bond0 | grep MII
MII Status: up
MII Polling Interval (ms): 1000
MII Status: up
MII Status: up
MII Status: up
MII Status: up

Workaround is starting on the offline node command:
for i in eno1 eno2 enp3s0f0 enp3s0f1; do ifdown ${i}; done
for i in eno1 eno2 enp3s0f0 enp3s0f1; do ifup ${i}; done

We do not have any similar issues in nodes with 4 interfaces

Fuel logs available on
http://mos-scale-share.mirantis.com/fuel-9.2-2017-01-19-17-39-logs.tar.gz

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Changed in fuel:
milestone: none → 9.3
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
importance: Undecided → High
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Workaround in action

root@node-2312:~# ping 10.21.0.2
PING 10.21.0.2 (10.21.0.2) 56(84) bytes of data
From 10.21.1.78 icmp_seq=1 Destination Host Unreachable
From 10.21.1.78 icmp_seq=2 Destination Host Unreachable
From 10.21.1.78 icmp_seq=3 Destination Host Unreachable
^C
--- 10.21.0.2 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3007ms

root@node-2312:~# for i in eno1 eno2 enp3s0f0 enp3s0f1; do ifdown ${i}; done
ifdown: interface eno2 not configured
ifdown: interface enp3s0f0 not configured
ifdown: interface enp3s0f1 not configured

root@node-2312:~# for i in eno1 eno2 enp3s0f0 enp3s0f1; do ifup ${i}; done
ntp stop/waiting
ntp start/running, process 149331
ntp stop/waiting
ntp start/running, process 149463
ntp stop/waiting
ntp start/running, process 149612
ntp stop/waiting
ntp start/running, process 149873

root@node-2312:~# ping 10.21.0.2
PING 10.21.0.2 (10.21.0.2) 56(84) bytes of data.
64 bytes from 10.21.0.2: icmp_seq=1 ttl=64 time=0.395 ms

Revision history for this message
Sergey Galkin (sgalkin) wrote :

Workaround is not workaround.
After
for i in eno1 eno2 enp3s0f0 enp3s0f1; do ifdown ${i}; for i in eno1 eno2 enp3s0f0 enp3s0f1; do ifup ${i}; done
and restart deployment the node with workaround gone offline again

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Please also provide the following information for the problem node:

- /etc/astute.yaml
- output of this command:
for i in eno1 eno2 enp3s0f0 enp3s0f1; do ethtool ${i}; done
- output of this command:
cat /proc/net/bonding/bond0
- output of this command:
lshw

Revision history for this message
Leontiy Istomin (listomin) wrote :

Unfortunately we removed the env with repro, therefore can't provide you with /etc/astute.yaml

Bond parameters were the following:
 - action: add-bond
   bond_properties:
   lacp_rate: fast
   mode: 802.3ad
   xmit_hash_policy: layer3+4
   bridge: br-fw-admin
   interface_properties:
   vendor_specific:
   disable_offloading: true
   interfaces:
   - eno1
   - eno2
   - enp3s0f0
   - enp3s0f1
   name: bond0

root@bootstrap:~# ethtool -i eno1
driver: i40e
version: 1.4.25-k
firmware-version: 4.41 0x80001863 16.5.20
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
root@bootstrap:~# ethtool -i eno2
driver: i40e
version: 1.4.25-k
firmware-version: 4.41 0x80001863 16.5.20
bus-info: 0000:01:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
root@bootstrap:~# ethtool -i enp3s0f0
driver: i40e
version: 1.4.25-k
firmware-version: 4.41 0x80001866 16.5.20
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
root@bootstrap:~# ethtool -i enp3s0f1
driver: i40e
version: 1.4.25-k
firmware-version: 4.41 0x80001866 16.5.20
bus-info: 0000:03:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
root@bootstrap:~# lspci | grep Ether
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
05:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

Revision history for this message
Leontiy Istomin (listomin) wrote :

Have found /etc/interfaces.d/ifcfg-bond0 conig:
auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2 enp3s0f0 enp3s0f1
bond-mode 802.3ad
bond-miimon 1000
bond-use-carrier 1
bond-lacp-rate fast
bond-updelay 30000
bond-downdelay 10000
bond-ad-select bandwidth
bond-xmit-hash-policy layer3+4

delays parameters multiplied by 10 because of we tried the following patch:http://paste.openstack.org/show/596089/

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Unfortunately when bond fails rsyslog is not able to send logs to Fuel node anymore, so we can't find anything related to bond0 failure in the archive attached above. Logs from the failed node and the information I asked previously are needed to research this issue further.

Revision history for this message
Sergey Galkin (sgalkin) wrote :

logs from failed node (Untitled (30:f0) - node-1990 - osscr01r12c31) available on http://mos-scale-share.mirantis.com/var-log-node-bond-error.tar.gz

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Download full text (13.2 KiB)

dmesg from failed node

[ 608.776981] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[ 608.787064] Bridge firewalling registered
[ 608.824705] 8021q: 802.1Q VLAN Support v1.8
[ 608.824720] 8021q: adding VLAN 0 to HW filter on device eno1
[ 608.824725] i40e 0000:01:00.0 eno1: adding f8:bc:12:06:c9:70 vid=0
[ 627.044017] device ovs-system entered promiscuous mode
[ 627.044227] netlink: Unknown key attribute (type=20, max=19).
[ 627.095276] device br-floating entered promiscuous mode
[ 627.183961] device p_ff798dba-0 entered promiscuous mode
[ 627.190043] br-ex: port 1(p_ff798dba-0) entered forwarding state
[ 627.190050] br-ex: port 1(p_ff798dba-0) entered forwarding state
[ 633.413407] i40e 0000:01:00.0 eno1: removing f8:bc:12:06:c9:70 vid=0
[ 633.413736] bonding: bond0: Adding slave eno1.
[ 633.413813] i40e 0000:01:00.0 eno1: already using mac address f8:bc:12:06:c9:70
[ 633.419025] 8021q: adding VLAN 0 to HW filter on device eno1
[ 633.419040] i40e 0000:01:00.0 eno1: adding f8:bc:12:06:c9:70 vid=0
[ 633.419242] bonding: bond0: enslaving eno1 as an active interface with an up link.
[ 633.443242] bonding: bond0: Adding slave eno2.
[ 633.443247] i40e 0000:01:00.1 eno2: set new mac address f8:bc:12:06:c9:70
[ 633.461615] i40e 0000:01:00.1 eno2: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[ 633.461788] 8021q: adding VLAN 0 to HW filter on device eno2
[ 633.461790] i40e 0000:01:00.1 eno2: adding f8:bc:12:06:c9:70 vid=0
[ 633.461947] bonding: bond0: enslaving eno2 as an active interface with an up link.
[ 633.487139] bonding: bond0: Adding slave enp3s0f0.
[ 633.487145] i40e 0000:03:00.0 enp3s0f0: set new mac address f8:bc:12:06:c9:70
[ 633.488363] i40e 0000:01:00.1 eno2: NIC Link is Down
[ 633.507526] i40e 0000:03:00.0 enp3s0f0: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[ 633.507639] 8021q: adding VLAN 0 to HW filter on device enp3s0f0
[ 633.507641] i40e 0000:03:00.0 enp3s0f0: adding f8:bc:12:06:c9:70 vid=0
[ 633.507799] bonding: bond0: enslaving enp3s0f0 as an active interface with an up link.
[ 633.532945] bonding: bond0: Adding slave enp3s0f1.
[ 633.532950] i40e 0000:03:00.1 enp3s0f1: set new mac address f8:bc:12:06:c9:70
[ 633.534239] i40e 0000:03:00.0 enp3s0f0: NIC Link is Down
[ 633.562735] i40e 0000:03:00.1 enp3s0f1: NIC Link is Up 10 Gbps Full Duplex, Flow Control: None
[ 633.562886] 8021q: adding VLAN 0 to HW filter on device enp3s0f1
[ 633.562888] i40e 0000:03:00.1 enp3s0f1: adding f8:bc:12:06:c9:70 vid=0
[ 633.563047] bonding: bond0: enslaving enp3s0f1 as an active interface with an up link.
[ 633.576316] bonding: bond0: Removing slave eno1.
[ 633.576568] bonding: bond0: releasing active interface eno1
[ 633.576572] bonding: bond0: Warning: the permanent HWaddr of eno1 - f8:bc:12:06:c9:70 - is still in use by bond0. Set the HWaddr of eno1 to a different address to avoid conflicts.
[ 633.588549] i40e 0000:03:00.1 enp3s0f1: NIC Link is Down
[ 633.590506] i40e 0000:01:00.0 eno1: removing f8:bc:12:06:c9:70 vid=0
[ 633.590518] i40e 0000:01:00.0 eno1: already using mac address f8:bc:12:06:c9:70
[ 633.590675] bonding: bond0: Removing slave eno2.
[ 633.590901...

Changed in fuel:
milestone: 9.x-updates → 11.0
status: New → Confirmed
tags: added: area-library
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/427123

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/427123
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=6fde41c44af5ce785053fdf40eddad65a1a1bff4
Submitter: Jenkins
Branch: master

commit 6fde41c44af5ce785053fdf40eddad65a1a1bff4
Author: Sergey Vasilenko <email address hidden>
Date: Mon Jan 30 17:53:13 2017 +0300

    Prevent a bond for unrequired re-assembles

    * Assemble bond members under bond in UP state
    * FIX lost bridge parameter
    * FIX lost 'use_carrier' bond property
    * waiting for bond UP after slaves added
    * additional flush IP addresses from slave while bond assemble

    Change-Id: I568d852e65dc5d5c246e11deb0740f4a608f5ecc
    Closes-bug: #1658981
    Related-bug: #1657750

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/430173

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/430174

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/430174
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=a848384706cdf69e7241ef4f51c9e2534805f586
Submitter: Jenkins
Branch: stable/mitaka

commit a848384706cdf69e7241ef4f51c9e2534805f586
Author: Sergey Vasilenko <email address hidden>
Date: Mon Jan 30 17:53:13 2017 +0300

    Prevent a bond for unrequired re-assembles

    * Assemble bond members under bond in UP state
    * FIX lost bridge parameter
    * FIX lost 'use_carrier' bond property
    * waiting for bond UP after slaves added
    * additional flush IP addresses from slave while bond assemble

    Change-Id: I568d852e65dc5d5c246e11deb0740f4a608f5ecc
    Closes-bug: #1658981
    Related-bug: #1657750
    (cherry picked from commit 6fde41c44af5ce785053fdf40eddad65a1a1bff4)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/newton)

Reviewed: https://review.openstack.org/430173
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=dedeaa789c78f8f4a9282e87210468a90eb09b37
Submitter: Jenkins
Branch: stable/newton

commit dedeaa789c78f8f4a9282e87210468a90eb09b37
Author: Sergey Vasilenko <email address hidden>
Date: Mon Jan 30 17:53:13 2017 +0300

    Prevent a bond for unrequired re-assembles

    * Assemble bond members under bond in UP state
    * FIX lost bridge parameter
    * FIX lost 'use_carrier' bond property
    * waiting for bond UP after slaves added
    * additional flush IP addresses from slave while bond assemble

    Change-Id: I568d852e65dc5d5c246e11deb0740f4a608f5ecc
    Closes-bug: #1658981
    Related-bug: #1657750
    (cherry picked from commit 6fde41c44af5ce785053fdf40eddad65a1a1bff4)

tags: added: in-stable-newton
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.