bonding error on dell r220

Bug #1471647 reported by Big Switch Networks
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Stanislav Makar
6.1.x
Invalid
High
Stanislav Makar

Bug Description

We are using dell r220 to deploy a 3-controller cluster. We put eth0 and eth1 into an active-active bond0, and put management, storage, public and tenant network on that bond. However, the 3-controller cluster become unstable. What we notice in syslog is following.

<3>Jul 2 23:19:37 node-13 kernel: [ 27.018459] i8042: No controller found
<3>Jul 2 23:19:37 node-13 kernel: [ 44.170300] bonding: bond0: unable to update mode because interface is up.
<3>Jul 2 23:19:37 node-13 kernel: [ 45.191011] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<3>Jul 2 23:19:37 node-13 kernel: [ 46.059381] bnx2x 0000:01:00.1 eth1: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<3>Jul 2 23:19:37 node-13 kernel: [ 47.319893] bnx2x 0000:01:00.1 eth1: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<3>Jul 2 23:19:37 node-13 kernel: [ 48.492446] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<11>Jul 2 23:19:55 node-13 openhpid: ERROR: (init.c, 76, OpenHPI is not configured. See openhpi.conf file.)
<11>Jul 2 23:19:55 node-13 openhpid: ERROR: (openhpid.cpp, 270, There was an error initializing OpenHPI)
<27>Jul 2 23:19:57 node-13 ntpdate[3504]: Can't find host 0.pool.ntp.org: Name or service not known (-2)
<27>Jul 2 23:19:57 node-13 ntpdate[3504]: the NTP socket is in use, exiting
<27>Jul 2 23:20:44 node-13 ns_conntrackd(p_conntrackd)[7969]: ERROR: Device "conntrd" does not exist.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server horizon/node-13 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server horizon/node-16 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server horizon/node-17 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 2 23:20:56 node-13 haproxy[9248]: proxy horizon has no server available!
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server keystone-1/node-13 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server keystone-1/node-16 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server keystone-1/node-17 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 2 23:20:56 node-13 haproxy[9248]: proxy keystone-1 has no server available!
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server neutron/node-13 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server neutron/node-16 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server neutron/node-17 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 2 23:20:57 node-13 haproxy[9272]: proxy neutron has no server available!
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server mysqld/node-13 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 4ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<27>Jul 2 23:21:26 node-13 mysql-wss(p_mysql)[10456]: ERROR: MySQL is not running
<27>Jul 3 01:04:38 node-13 ns_IPaddr2(vip__management)[25441]: ERROR: Device "br-mgmt-hapr" does not exist.
<129>Jul 3 01:04:39 node-13 haproxy[25559]: Server neutron/node-13 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 3 01:04:40 node-13 haproxy[25559]: Server mysqld/node-13 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 7ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<27>Jul 3 01:04:40 node-13 ns_IPaddr2(vip__management)[25441]: ERROR: Could not send gratuitous arps
<129>Jul 3 01:04:40 node-13 haproxy[25559]: Backup Server mysqld/node-16 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 7ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 3 01:04:41 node-13 haproxy[25559]: Backup Server mysqld/node-17 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 884ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 3 01:04:41 node-13 haproxy[25559]: proxy mysqld has no server available!

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
milestone: none → 7.0
importance: Undecided → High
status: New → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislav Makar (smakar)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Please provide more info, fuel version, deployment settings, diagnostic snapshot.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :

Fuel version is 6.1 GA
3 controller node astute.yaml files are in the attached syslog.tar.gz
syslogs from three controller nodes are in the attached syslog.tar.gz as well

Revision history for this message
Stanislav Makar (smakar) wrote :

syslog and astute.yaml are fine but it is not enough
We need at least puppet logs from that nodes that is why we asked diagnostic snapshot
The best option would be to have the access to such cluster

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :
Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :

puppet logs of all 3 nodes are added in the attachment.

Revision history for this message
Pavel Boldin (pboldin) wrote :

Please run the network connectivity test.

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :

We'd like to provide diagnostic snapshot as well, but how to do that?

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :
Download full text (12.6 KiB)

At this point, network connectivity is fine among three nodes(I did a full-mesh ping among all three nodes, but just paste the result from one node). However, crm is complaining about partition.

root@node-13:~# ifconfig
bond0 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:831521825 errors:428935533 dropped:283912149 overruns:428935533 frame:0
          TX packets:56149307 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:100303558594 (100.3 GB) TX bytes:14138710825 (14.1 GB)

bond0.2 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:120458 errors:0 dropped:0 overruns:0 frame:0
          TX packets:120217 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12045960 (12.0 MB) TX bytes:13730236 (13.7 MB)

bond0.3 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:546816276 errors:0 dropped:0 overruns:0 frame:0
          TX packets:55534734 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:36741895112 (36.7 GB) TX bytes:13643036112 (13.6 GB)

bond0.4001 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:8665 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8809 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3351156 (3.3 MB) TX bytes:1060729 (1.0 MB)

br-aux Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:766442 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:35256440 (35.2 MB) TX bytes:648 (648.0 B)

br-ex Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet addr:10.9.28.12 Bcast:10.9.29.255 Mask:255.255.254.0
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:35575 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35636 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:5540684 (5.5 MB) TX bytes:3625067 (3.6 MB)

br-floating Link encap:Ethernet HWaddr fa:2c:d6:ba:75:47
          inet6 addr: fe80::f82c:d6ff:feba:7547/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:219 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX byte...

Revision history for this message
Pavel Boldin (pboldin) wrote :

Please be sure you have checked the network connectivity:
1. Using the tools provided by Fuel UI.
2. On all the interfaces when you do it by hand.s

Revision history for this message
Stanislav Makar (smakar) wrote :

According to info you provided you do not have problems with networking
crm status
Last updated: Mon Jul 6 14:57:46 2015
Last change: Mon Jul 6 04:16:17 2015
Stack: corosync
Current DC: node-16.domain.tld (16) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
34 Resources configured

Online: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]

Above means that all is ok, traffic goes via bond. Pacemaker cluster uses management network which is on bond0, vlan 3.

One thing I see here that first time connectivity_tests.pp task have not been applied
but then looks like was second turn (re-deploy) and success

I also see that post deployment tasks have not been applied to any controller that means that deployment failed and that is why I have partitioning.

Now only logs from directory /var/log/docker-logs/ on master node can help us

About diagnostic snapshot - the easiest way to do it is to fire the command
[root@nailgun ~]# fuel snapshot
Generating dump...
Downloading: http://127.0.0.1:8000/dump/fuel-snapshot-2015-07-07_11-18-08.tar.xz Bytes: 135969364
[==============================================================================]()
[root@nailgun ~]# ls -la fuel-snapshot-2015-07-07_11-18-08.tar.xz
-rw-r--r-- 1 root root 135969364 Jul 7 11:20 fuel-snapshot-2015-07-07_11-18-08.tar.xz

and upload it to us

It also is available via FUEL Web UI on the support tab Download Diagnostic Snapshot

if you are experiencing the problems with generating diagnostic snapshot please provide the logs from the directory /var/log/docker-logs/ on master node.

Thanks.

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :

We tried to get diagnostic snapshot in GUI, but didn't succeed. The button "Generating Logs Snapshot" get grayed out and never came back. (We have 100 nodes in the setup)
We also tried to get the docker-logs. However, the logs in the remote directory are so large that even we just included the remote logs for controllers, the tarball is 280MB. We failed to upload it many times.

In addition, we notice following patten:
1. If we configure round-robin bonding mode on two nodes via fuel GUI, and start ping between two nodes via their bonds. ping will randomly fail. We believe this is why mysql and rabbitmq clusters get into trouble. We still need to verify who's dropping the pkt: the network or the node. This will take some time.
2. If we configure LACP bonding mode on two nodes via fuel GUI, ping between two nodes becomes stable, mysql and rabbitmq seems fine.

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :

only include the puppet logs for controller nodes

Revision history for this message
Stanislav Makar (smakar) wrote :
Download full text (3.5 KiB)

I see that docker-logs.tar.gz has not logs for 2015-07-01 - 2015-07-03 which syslog, puppet log, astute.yaml has
It has logs for 2015-07-06T18:41:07 - 2015-07-08T06:19:38

What I have found in docker-logs.tar.gz

 grep "Deployment of environment" -r .
./nailgun/receiverd.log:2015-07-06 20:44:21.744 INFO [7f1aef996700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-01 21:28:40.299 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 00:17:09.207 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 05:48:20.953 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 06:11:34.967 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 06:33:47.051 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 15:30:38.239 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 16:33:22.053 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 18:15:27.823 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 18:47:41.069 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 19:31:30.298 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-03 04:57:37.592 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-03 05:15:42.186 INFO [7f...

Read more...

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :

We may have lost the log for the failed case. Ever after we found out that LACP bonding works, we started to always use LACP bond to move forward. We'll provide the same set of logs next time when this problem happens again.

Revision history for this message
Big Switch Networks (fuel-bugs-internal) wrote :

By upgrading switches and switch SDN controller to a newer version, we couldn't reproduce this problem. It might be a problem on the network side. We need more time to investigate. Thanks a lot for all the help!

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

Since the problem is not reproducible anymore and it's been more than 3 weeks since it was set to be incomplete, setting it to Invalid. Please re-open, if the problem occurs again?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There were no updates and bug remained incomplete for a month. Moving to invalid.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.