Fuel for OpenStack

bonding error on dell r220

Bug #1471647 reported by Big Switch Networks on 2015-07-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Stanislav Makar	Fuel for OpenStack 7.0
	6.1.x	Invalid	High	Stanislav Makar	Fuel for OpenStack 6.1-updates

Bug Description

We are using dell r220 to deploy a 3-controller cluster. We put eth0 and eth1 into an active-active bond0, and put management, storage, public and tenant network on that bond. However, the 3-controller cluster become unstable. What we notice in syslog is following.

<3>Jul 2 23:19:37 node-13 kernel: [ 27.018459] i8042: No controller found
<3>Jul 2 23:19:37 node-13 kernel: [ 44.170300] bonding: bond0: unable to update mode because interface is up.
<3>Jul 2 23:19:37 node-13 kernel: [ 45.191011] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<3>Jul 2 23:19:37 node-13 kernel: [ 46.059381] bnx2x 0000:01:00.1 eth1: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<3>Jul 2 23:19:37 node-13 kernel: [ 47.319893] bnx2x 0000:01:00.1 eth1: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<3>Jul 2 23:19:37 node-13 kernel: [ 48.492446] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from 3M part number 1410-P17-00-3.00
<11>Jul 2 23:19:55 node-13 openhpid: ERROR: (init.c, 76, OpenHPI is not configured. See openhpi.conf file.)
<11>Jul 2 23:19:55 node-13 openhpid: ERROR: (openhpid.cpp, 270, There was an error initializing OpenHPI)
<27>Jul 2 23:19:57 node-13 ntpdate[3504]: Can't find host 0.pool.ntp.org: Name or service not known (-2)
<27>Jul 2 23:19:57 node-13 ntpdate[3504]: the NTP socket is in use, exiting
<27>Jul 2 23:20:44 node-13 ns_conntrackd(p_conntrackd)[7969]: ERROR: Device "conntrd" does not exist.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server horizon/node-13 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server horizon/node-16 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server horizon/node-17 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 2 23:20:56 node-13 haproxy[9248]: proxy horizon has no server available!
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server keystone-1/node-13 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server keystone-1/node-16 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:56 node-13 haproxy[9248]: Server keystone-1/node-17 is DOWN, reason: Layer4 connection problem, info: "General socket error (Network is unreachable)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 2 23:20:56 node-13 haproxy[9248]: proxy keystone-1 has no server available!
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server neutron/node-13 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server neutron/node-16 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server neutron/node-17 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 2 23:20:57 node-13 haproxy[9272]: proxy neutron has no server available!
<129>Jul 2 23:20:57 node-13 haproxy[9272]: Server mysqld/node-13 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 4ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<27>Jul 2 23:21:26 node-13 mysql-wss(p_mysql)[10456]: ERROR: MySQL is not running
<27>Jul 3 01:04:38 node-13 ns_IPaddr2(vip__management)[25441]: ERROR: Device "br-mgmt-hapr" does not exist.
<129>Jul 3 01:04:39 node-13 haproxy[25559]: Server neutron/node-13 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 3 01:04:40 node-13 haproxy[25559]: Server mysqld/node-13 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 7ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<27>Jul 3 01:04:40 node-13 ns_IPaddr2(vip__management)[25441]: ERROR: Could not send gratuitous arps
<129>Jul 3 01:04:40 node-13 haproxy[25559]: Backup Server mysqld/node-16 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 7ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Jul 3 01:04:41 node-13 haproxy[25559]: Backup Server mysqld/node-17 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 884ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jul 3 01:04:41 node-13 haproxy[25559]: proxy mysqld has no server available!

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-06:

syslog.tar.gz Edit (187.4 KiB, application/x-tar)

Alexei Sheplyakov (asheplyakov) on 2015-07-06

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Oleksiy Molchanov (omolchanov) on 2015-07-06

Changed in fuel:
milestone:	none → 7.0
importance:	Undecided → High
status:	New → Confirmed

Oleksiy Molchanov (omolchanov) on 2015-07-06

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Stanislav Makar (smakar)

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-07-06:

Please provide more info, fuel version, deployment settings, diagnostic snapshot.

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-06:

Fuel version is 6.1 GA
3 controller node astute.yaml files are in the attached syslog.tar.gz
syslogs from three controller nodes are in the attached syslog.tar.gz as well

Revision history for this message

Stanislav Makar (smakar) wrote on 2015-07-06:

syslog and astute.yaml are fine but it is not enough
We need at least puppet logs from that nodes that is why we asked diagnostic snapshot
The best option would be to have the access to such cluster

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-06:

syslog, puppet log, astute.yaml Edit (2.4 MiB, application/x-tar)

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-06:

puppet logs of all 3 nodes are added in the attachment.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-07-06:

Please run the network connectivity test.

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-06:

We'd like to provide diagnostic snapshot as well, but how to do that?

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-06:

Download full text (12.6 KiB)

At this point, network connectivity is fine among three nodes(I did a full-mesh ping among all three nodes, but just paste the result from one node). However, crm is complaining about partition.

root@node-13:~# ifconfig
bond0 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:831521825 errors:428935533 dropped:283912149 overruns:428935533 frame:0
          TX packets:56149307 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:100303558594 (100.3 GB) TX bytes:14138710825 (14.1 GB)

bond0.2 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:120458 errors:0 dropped:0 overruns:0 frame:0
          TX packets:120217 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12045960 (12.0 MB) TX bytes:13730236 (13.7 MB)

bond0.3 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:546816276 errors:0 dropped:0 overruns:0 frame:0
          TX packets:55534734 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:36741895112 (36.7 GB) TX bytes:13643036112 (13.6 GB)

bond0.4001 Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:8665 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8809 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3351156 (3.3 MB) TX bytes:1060729 (1.0 MB)

br-aux Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:766442 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:35256440 (35.2 MB) TX bytes:648 (648.0 B)

br-ex Link encap:Ethernet HWaddr 00:0e:1e:8e:97:80
          inet addr:10.9.28.12 Bcast:10.9.29.255 Mask:255.255.254.0
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:35575 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35636 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:5540684 (5.5 MB) TX bytes:3625067 (3.6 MB)

br-floating Link encap:Ethernet HWaddr fa:2c:d6:ba:75:47
          inet6 addr: fe80::f82c:d6ff:feba:7547/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:219 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX byte...

At this point, network connectivity is fine among three nodes(I did a full-mesh ping among all three nodes, but just paste the result from one node). However, crm is complaining about partition.

root@node-13:~# ifconfig
bond0     Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:831521825 errors:428935533 dropped:283912149 overruns:428935533 frame:0
          TX packets:56149307 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:100303558594 (100.3 GB)  TX bytes:14138710825 (14.1 GB)

bond0.2   Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:120458 errors:0 dropped:0 overruns:0 frame:0
          TX packets:120217 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:12045960 (12.0 MB)  TX bytes:13730236 (13.7 MB)

bond0.3   Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:546816276 errors:0 dropped:0 overruns:0 frame:0
          TX packets:55534734 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:36741895112 (36.7 GB)  TX bytes:13643036112 (13.6 GB)

bond0.4001 Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8665 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8809 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3351156 (3.3 MB)  TX bytes:1060729 (1.0 MB)

br-aux    Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:766442 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:35256440 (35.2 MB)  TX bytes:648 (648.0 B)

br-ex     Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet addr:10.9.28.12  Bcast:10.9.29.255  Mask:255.255.254.0
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:35575 errors:0 dropped:0 overruns:0 frame:0
          TX packets:35636 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:5540684 (5.5 MB)  TX bytes:3625067 (3.6 MB)

br-floating Link encap:Ethernet  HWaddr fa:2c:d6:ba:75:47  
          inet6 addr: fe80::f82c:d6ff:feba:7547/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:219 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:12454 (12.4 KB)  TX bytes:648 (648.0 B)

br-fw-admin Link encap:Ethernet  HWaddr 54:9f:35:23:cb:b0  
          inet addr:10.9.26.4  Bcast:10.9.27.255  Mask:255.255.254.0
          inet6 addr: fe80::569f:35ff:fe23:cbb0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:290527223 errors:0 dropped:31872901 overruns:0 frame:0
          TX packets:370197 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:12569418042 (12.5 GB)  TX bytes:131217918 (131.2 MB)

br-mgmt   Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet addr:192.168.0.4  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:397335430 errors:0 dropped:8223 overruns:0 frame:0
          TX packets:60838106 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:28555520050 (28.5 GB)  TX bytes:15340559307 (15.3 GB)

br-mgmt-hapr Link encap:Ethernet  HWaddr 1a:ad:eb:63:da:e2  
          inet6 addr: fe80::18ad:ebff:fe63:dae2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4161478 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4190492 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1124371219 (1.1 GB)  TX bytes:1141708890 (1.1 GB)

br-prv    Link encap:Ethernet  HWaddr be:bf:77:1a:5c:40  
          inet6 addr: fe80::bcbf:77ff:fe1a:5c40/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:10 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:788 (788.0 B)  TX bytes:648 (648.0 B)

br-storage Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:120420 errors:0 dropped:6 overruns:0 frame:0
          TX packets:120209 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:12021780 (12.0 MB)  TX bytes:13729588 (13.7 MB)

eth0      Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:312666897 errors:103527382 dropped:95042010 overruns:103527382 frame:0
          TX packets:28076461 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:36842136410 (36.8 GB)  TX bytes:7067868307 (7.0 GB)
          Interrupt:16 Memory:a4000000-a47fffff

eth1      Link encap:Ethernet  HWaddr 00:0e:1e:8e:97:80  
          inet6 addr: fe80::20e:1eff:fe8e:9780/64 Scope:Link
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:518854928 errors:325408151 dropped:188863695 overruns:325408151 frame:0
          TX packets:28072846 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:63461422184 (63.4 GB)  TX bytes:7070842518 (7.0 GB)
          Interrupt:17 Memory:a3000000-a37fffff

eth2      Link encap:Ethernet  HWaddr 54:9f:35:23:cb:b0  
          inet6 addr: fe80::569f:35ff:fe23:cbb0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:617575367 errors:0 dropped:104245737 overruns:0 frame:1794910
          TX packets:392705 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:41890077268 (41.8 GB)  TX bytes:134280648 (134.2 MB)
          Interrupt:16

hapr-host Link encap:Ethernet  HWaddr 06:22:e3:d5:47:b8  
          inet addr:240.0.0.1  Bcast:0.0.0.0  Mask:255.255.255.252
          inet6 addr: fe80::422:e3ff:fed5:47b8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:62874 errors:0 dropped:8 overruns:0 frame:0
          TX packets:42698 errors:0 dropped:22 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4393530 (4.3 MB)  TX bytes:5131280 (5.1 MB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:1437759 errors:0 dropped:297 overruns:0 frame:0
          TX packets:1437759 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:679584858 (679.5 MB)  TX bytes:679584858 (679.5 MB)

mgmt-conntrd Link encap:Ethernet  HWaddr 56:0d:f0:e5:ad:bf  
          inet6 addr: fe80::540d:f0ff:fee5:adbf/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:33523 errors:0 dropped:10 overruns:0 frame:0
          TX packets:291348225 errors:0 dropped:155351801 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1920358 (1.9 MB)  TX bytes:18273076901 (18.2 GB)

p_br-floating-0 Link encap:Ethernet  HWaddr 6a:36:bc:31:30:6e  
          inet6 addr: fe80::6836:bcff:fe31:306e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65000  Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:222 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:480 (480.0 B)  TX bytes:12580 (12.5 KB)

p_br-prv-0 Link encap:Ethernet  HWaddr 96:38:a3:ba:ef:ae  
          inet6 addr: fe80::9438:a3ff:feba:efae/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65000  Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:468 (468.0 B)  TX bytes:788 (788.0 B)

vrouter-host Link encap:Ethernet  HWaddr 42:48:c2:9d:ed:07  
          inet addr:240.0.0.5  Bcast:0.0.0.0  Mask:255.255.255.252
          inet6 addr: fe80::4048:c2ff:fe9d:ed07/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:808 errors:0 dropped:0 overruns:0 frame:0
          TX packets:746 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:61058 (61.0 KB)  TX bytes:59668 (59.6 KB)

root@node-13:~# ping node-16
PING node-16.domain.tld (192.168.0.7) 56(84) bytes of data.
64 bytes from node-16.domain.tld (192.168.0.7): icmp_seq=1 ttl=64 time=0.106 ms
64 bytes from node-16.domain.tld (192.168.0.7): icmp_seq=2 ttl=64 time=0.079 ms
^C
--- node-16.domain.tld ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.079/0.092/0.106/0.016 ms
root@node-13:~# ping node-17
PING node-17.domain.tld (192.168.0.8) 56(84) bytes of data.
64 bytes from node-17.domain.tld (192.168.0.8): icmp_seq=1 ttl=64 time=0.102 ms
64 bytes from node-17.domain.tld (192.168.0.8): icmp_seq=2 ttl=64 time=0.068 ms
64 bytes from node-17.domain.tld (192.168.0.8): icmp_seq=3 ttl=64 time=0.107 ms
^C
--- node-17.domain.tld ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.068/0.092/0.107/0.019 ms
root@node-13:~#

root@node-13:~# crm status
Last updated: Mon Jul  6 14:57:46 2015
Last change: Mon Jul  6 04:16:17 2015
Stack: corosync
Current DC: node-16.domain.tld (16) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
34 Resources configured

Online: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]

Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 vip__management	(ocf::fuel:ns_IPaddr2):	Started node-13.domain.tld 
 vip__public_vrouter	(ocf::fuel:ns_IPaddr2):	Started node-16.domain.tld 
 vip__management_vrouter	(ocf::fuel:ns_IPaddr2):	Started node-16.domain.tld 
 vip__public	(ocf::fuel:ns_IPaddr2):	Started node-16.domain.tld 
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-16.domain.tld ]
     Slaves: [ node-13.domain.tld node-17.domain.tld ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]

Failed actions:
    p_mysql_monitor_60000 on node-16.domain.tld 'unknown error' (1): call=380, status=complete, last-rc-change='Mon Jul  6 13:13:57 2015', queued=0ms, exec=0ms
    p_neutron-plugin-openvswitch-agent_stop_0 on node-13.domain.tld 'unknown error' (1): call=453, status=Timed Out, last-rc-change='Mon Jul  6 13:14:03 2015', queued=0ms, exec=80002ms
root@node-13:~#

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-07-07:

#10

Please be sure you have checked the network connectivity:
1. Using the tools provided by Fuel UI.
2. On all the interfaces when you do it by hand.s

Revision history for this message

Stanislav Makar (smakar) wrote on 2015-07-07:

#11

According to info you provided you do not have problems with networking
crm status
Last updated: Mon Jul 6 14:57:46 2015
Last change: Mon Jul 6 04:16:17 2015
Stack: corosync
Current DC: node-16.domain.tld (16) - partition with quorum
Version: 1.1.12-561c4cf
3 Nodes configured
34 Resources configured

Online: [ node-13.domain.tld node-16.domain.tld node-17.domain.tld ]

Above means that all is ok, traffic goes via bond. Pacemaker cluster uses management network which is on bond0, vlan 3.

One thing I see here that first time connectivity_tests.pp task have not been applied
but then looks like was second turn (re-deploy) and success

I also see that post deployment tasks have not been applied to any controller that means that deployment failed and that is why I have partitioning.

Now only logs from directory /var/log/docker-logs/ on master node can help us

About diagnostic snapshot - the easiest way to do it is to fire the command
[root@nailgun ~]# fuel snapshot
Generating dump...
Downloading: http://127.0.0.1:8000/dump/fuel-snapshot-2015-07-07_11-18-08.tar.xz Bytes: 135969364
[==============================================================================]()
[root@nailgun ~]# ls -la fuel-snapshot-2015-07-07_11-18-08.tar.xz
-rw-r--r-- 1 root root 135969364 Jul 7 11:20 fuel-snapshot-2015-07-07_11-18-08.tar.xz

and upload it to us

It also is available via FUEL Web UI on the support tab Download Diagnostic Snapshot

if you are experiencing the problems with generating diagnostic snapshot please provide the logs from the directory /var/log/docker-logs/ on master node.

Thanks.

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-08:

#12

We tried to get diagnostic snapshot in GUI, but didn't succeed. The button "Generating Logs Snapshot" get grayed out and never came back. (We have 100 nodes in the setup)
We also tried to get the docker-logs. However, the logs in the remote directory are so large that even we just included the remote logs for controllers, the tarball is 280MB. We failed to upload it many times.

In addition, we notice following patten:
1. If we configure round-robin bonding mode on two nodes via fuel GUI, and start ping between two nodes via their bonds. ping will randomly fail. We believe this is why mysql and rabbitmq clusters get into trouble. We still need to verify who's dropping the pkt: the network or the node. This will take some time.
2. If we configure LACP bonding mode on two nodes via fuel GUI, ping between two nodes becomes stable, mysql and rabbitmq seems fine.

Revision history for this message

Big Switch Networks (fuel-bugs-internal) wrote on 2015-07-08:

#13

docker-logs.tar.gz Edit (69.3 MiB, application/x-tar)

only include the puppet logs for controller nodes

Revision history for this message

Stanislav Makar (smakar) wrote on 2015-07-09:

#14

Download full text (3.5 KiB)

I see that docker-logs.tar.gz has not logs for 2015-07-01 - 2015-07-03 which syslog, puppet log, astute.yaml has
It has logs for 2015-07-06T18:41:07 - 2015-07-08T06:19:38

What I have found in docker-logs.tar.gz

grep "Deployment of environment" -r .
./nailgun/receiverd.log:2015-07-06 20:44:21.744 INFO [7f1aef996700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-01 21:28:40.299 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 00:17:09.207 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 05:48:20.953 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 06:11:34.967 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 06:33:47.051 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 15:30:38.239 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 16:33:22.053 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 18:15:27.823 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 18:47:41.069 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-02 19:31:30.298 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-03 04:57:37.592 INFO [7fc32d7f1700] (notification) Notification: topic: done message: Deployment of environment 'T6-Scale' is done. Access the OpenStack dashboard (Horizon) at http://10.9.28.10/
./supervisor/supervisord.log-20150705:2015-07-03 05:15:42.186 INFO [7f...

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.