3.1.3.0-72: BUM Tree corrupted after clean installation
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R3.1 |
Fix Committed
|
Critical
|
Manish Singh | |||
R3.2 |
Fix Committed
|
Critical
|
Manish Singh | |||
R4.0 |
Fix Committed
|
Critical
|
Manish Singh | |||
Trunk |
Fix Committed
|
Critical
|
Manish Singh |
Bug Description
There were issues reported with QFX missing from the BUM tree before. Manish investigated and he pointed to a corrupted pointer in the tor-agent core collected from problematic setup. Earlier, we suspected some issue with the ISSU upgrade (as ISSU procedure was used to upgrade Contrail from 2.21.2 to 3.1.3.0-72).
The new occurence is reported on a clean installation of 3.1.3.0-72 build. Here is JTAC's analysis:
VNIs being tested.
1. 4338
2. 3559
IP addresses as follows:
(TSN) openc-34 172.23.10.201
(TSN) openc-35 172.23.10.202
(QFX6) 172.23.11.48
(QFX23) 172.23.11.49
QFX6 is being served by contrail-
QFX23 is being served by contrail-
Please see below:
root@openc-34:~# contrail-status | grep QFX
contrail-
contrail-
contrail-
contrail-
root@openc-34:~#
root@openc-35:~# contrail-status | grep QFX
contrail-
contrail-
contrail-
contrail-
Test Case 1: (VNI 4338)
============
(30:06:23:00:03:42) => [QFX6 ae7.2834] => [openc-34] => [openc-35] => [QFX23 ae1.3834] => (30:23:06:00:03:42)
Result: On openc-34, QFX6 is missing && openc-35 is present.
On openc-35, QFX23 is missing && openc-34 is present.
<<< BUM traffic broken completely >>>
Test Case 2: (VNI 3559)
============
(30:06:23:00:00:37) => [QFX6 ae7.2055] => [openc-34] => [openc-35] => [QFX23 ae1.3055] => (30:23:06:00:00:37)
Result: On openc-34, QFX6 is missing && openc-35 present.
On openc-35, QFX23 is present && openc-34 is also present.
<<< BUM Traffic one way is blocked which is openc-35 ==> openc-34 >>>
All cores can be found here:
/home/ssandeep/
Greetings,
Sandeep.
information type: | Proprietary → Public |
tags: | added: bms vrouter |
Changed in juniperopenstack: | |
assignee: | nobody → Manish Singh (manishs) |
importance: | Undecided → Critical |
milestone: | none → r3.1.3.0 |
tags: | added: nttc |
description: | updated |
Hi Manish,
Their test procedure is as below:
The environment is installed clean (3 Control Node, 4 TSN, 128 ToR-Agent, 4 Compute Node, 1 Openstack Node) as following.
After that, they go ahead provisioning LIF and pass traffic.
======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= = utils/ pkg_all: /tmp/contrail- install- packages_ 3.1.3.0- 73~mitaka_ all.deb
A1-1 Setup Procedure
cd /opt/contrail/
fab install_
fab upgrade_kernel_all
fab install_contrail
fab setup_all
A1-2
They encounter an issue with TSN being unstable after reboot (due to bond settings that you guys worked with Mehul and resolved it)
A1-3
execute "service supervisor-vrouter restart" at 4 TSN Node.
A1-5 contrail- *-agent* .conf
Modify following Params
(TSN&Compute)
/etc/contrail/
headless_mode = true
/etc/contrail/ supervisord_ vrouter. conf TBB_THREAD_ COUNT = 8
environment=
(TSN) d/vrouter. conf labels= 256000 vr_nexthops=521000 vr_vrfs=65536 vr_bridge_ entries= 1000000
/etc/modprobe.
options vrouter vr_mpls_
(Compute) d/vrouter. conf labels= 11520 vr_flow_ entries= 2097152
/etc/modprobe.
options vrouter vr_mpls_
/etc/contrail/ contrail- vrouter- agent.conf flow_collection = True
[DEFAULT]
flow_cache_timeout = 60
disable_
[FLOWS]
max_vm_flows = 45
remove virbr0
virsh net-destroy default
virsh net-autostart default --disable
A1-6 ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= =
All contrail-server were rebooted (excuted "shutdown -r now").
=======
They believe this issue is due to the POST they do (around 40 posts per second and there would be 10 sessions). The POST messages is more of creating virtual-networks, virtual-machines etc.
The logs below should help:
ssh root@10.219.48.123
password:Jtaclab123
[root@LocalStorage coreCollectedMa y26]# pwd 2017-0424- 0113/coreCollec tedMay26 y26]# ls -lrt JN-323_ tor-agent- 21-core. zip JN-323_ post.tar
/home/ssandeep/
[root@LocalStorage coreCollectedMa
total 382604
-rw-rw-r--. 1 1001 1001 1181918 May 23 08:05 20170523-pt008.log
-rw-rw-r--. 1 1001 1001 1181918 May 23 08:05 20170523-pt009.log
-rw-rw-r--. 1 1001 1001 3549981 May 23 08:13 20170523-pt002.log
-rw-rw-r--. 1 1001 1001 4729948 May 23 08:18 20170523-pt004.log
-rw-rw-r--. 1 1001 1001 4805984 May 23 08:19 20170523-pt001.log
-rw-rw-r--. 1 1001 1001 4805982 May 23 08:20 20170523-pt003.log
-rw-rw-r--. 1 1001 1001 7824423 May 23 08:34 20170523-pt007.log
-rw-rw-r--. 1 1001 1001 11948881 May 23 08:37 20170523-pt006.log
-rw-rw-r--. 1 1001 1001 11848002 May 23 08:40 20170523-pt011.log
-rw-rw-r--. 1 1001 1001 11850306 May 23 08:40 20170523-pt010.log
-rw-r--r--. 1 root root 264286785 May 26 21:07 20170526_
-rw-r--r--. 1 root root 63744000 May 26 21:13 20170523_
all *-pt*.log indicates the POST they are doing during which this issue occurs. This might give you some hint as to what could be resulting in this problem.
The zip file 20170526_ JN-323_ tor-agent- 21-core. zip has the tor-agent and tsn core we collected on openc-36. I will unicast you my notes which has VNI for your reference.
Please let me know when the binary is ready.
Greetings,
Sandeep.