Control node assertion in RibOutUpdates::PeerDequeue in scaled setup

Bug #1386460 reported by Nischal Sheth
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R1.1
Fix Committed
High
Nischal Sheth
R2.0
Fix Released
High
Nischal Sheth
Trunk
Fix Released
High
Nischal Sheth

Bug Description

Release 1.10 build 44.

Happened in Harshad's scale setup which has 1000 vRouters with 3 CNs.
Problem seems to happen multiple times when the setup is initializing.

Backtrace:

(gdb) bt
#0 0x00007f92609a0425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f92609a3b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f92609990ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f9260999192 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000616723 in RibOutUpdates::PeerDequeue (this=0x7f9208004e10, queue_id=1, peer=<optimized out>, mready=..., blocked=0x7f925e89e9b0) at controller/src/bgp/bgp_ribout_updates.cc:251
#5 0x00000000006717ef in SchedulingGroup::UpdatePeerQueue (this=0x7f91d400ec60, peer=0x7f92200081e0, ps=0x7f91d401e2e0, queue_id=1) at controller/src/bgp/scheduling_group.cc:1069
#6 0x0000000000671ab3 in SchedulingGroup::UpdatePeer (this=0x7f91d400ec60, peer=0x7f92200081e0) at controller/src/bgp/scheduling_group.cc:1110
#7 0x00000000006761fd in SchedulingGroup::Worker::Run (this=0x7f925005d290) at controller/src/bgp/scheduling_group.cc:437
#8 0x00000000009fccc0 in TaskImpl::execute (this=0x7f9250067f40) at controller/src/base/task.cc:224
#9 0x00007f9261c02ece in ?? () from /usr/lib/libtbb_debug.so.2
#10 0x00007f9261bf9e0b in ?? () from /usr/lib/libtbb_debug.so.2
#11 0x00007f9261bf86f2 in ?? () from /usr/lib/libtbb_debug.so.2
#12 0x00007f9261bf33ce in ?? () from /usr/lib/libtbb_debug.so.2
#13 0x00007f9261bf3270 in ?? () from /usr/lib/libtbb_debug.so.2
#14 0x00007f926174ae9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007f9260a5dccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x0000000000000000 in ?? ()

information type: Proprietary → Public
Nischal Sheth (nsheth)
Changed in juniperopenstack:
status: New → In Progress
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/5576
Committed: http://github.org/Juniper/contrail-controller/commit/e12740b76962930457bc55a948fc9af5de994a1a
Submitter: Zuul
Branch: R1.10

commit e12740b76962930457bc55a948fc9af5de994a1a
Author: Nischal Sheth <email address hidden>
Date: Thu Dec 11 13:58:44 2014 -0800

Fix corner case in SchedulingGroup::UpdatePeerQueue logic

An assertion fails if a peer gets blocked when dequeueing updates from
multiple RibOuts via SchedulingGroup::UpdatePeer.

Problem happens in the following situation:

- Peer was previously blocked and now has updates to send for 2 RibOuts.
- Updates for both RibOuts are for the same queue i.e. QBULK or QUPDATE.
- Peer shares a marker for the for the first RibOut with another peer or
peer's marker gets merged with marker for another peer when sending
updates for first RibOut (via RibOutUpdates::PeerDequeue)
- There are still more updates to be sent for the first RibOut i.e. the
processing in RibOutUpdates::PeerDequeue keeps going.
- Original peer gets send blocked, but we manage to dequeue all updates
for the first RibOut to the other peer with which the original peer's
marker got merged.
- RibOutUpdates::PeerDequeue returns true because of the previous point

At this point, we continue and try to dequeue updates for the 2nd RibOut
because RibOutUpdates::PeerDequeue returned success. We hit an assertion
in RibOutUpdates::PeerDequeue when called for the 2nd RibOut because the
original peer is not in the send ready set anymore.

Fix is to stop processing RibOuts for the peer if it's send blocked when
RibOutUpdates::PeerDequeue returns. This ensures that we don't hit the
assertion since we don't try to process the 2nd RibOut. Updates for the
2nd RibOut will be sent to the other peer when it's WorkPeer item gets
processed.

Change-Id: Ib1ef218ad9eecb1ca489b3045bdc3419e75caa21
Closes-Bug: 1386460

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/5574
Committed: http://github.org/Juniper/contrail-controller/commit/a3490d6cc1c1186f9b38d8213555670875ccebbc
Submitter: Zuul
Branch: master

commit a3490d6cc1c1186f9b38d8213555670875ccebbc
Author: Nischal Sheth <email address hidden>
Date: Thu Dec 11 13:58:44 2014 -0800

Fix corner case in SchedulingGroup::UpdatePeerQueue logic

An assertion fails if a peer gets blocked when dequeueing updates from
multiple RibOuts via SchedulingGroup::UpdatePeer.

Problem happens in the following situation:

- Peer was previously blocked and now has updates to send for 2 RibOuts.
- Updates for both RibOuts are for the same queue i.e. QBULK or QUPDATE.
- Peer shares a marker for the for the first RibOut with another peer or
peer's marker gets merged with marker for another peer when sending
updates for first RibOut (via RibOutUpdates::PeerDequeue)
- There are still more updates to be sent for the first RibOut i.e. the
processing in RibOutUpdates::PeerDequeue keeps going.
- Original peer gets send blocked, but we manage to dequeue all updates
for the first RibOut to the other peer with which the original peer's
marker got merged.
- RibOutUpdates::PeerDequeue returns true because of the previous point

At this point, we continue and try to dequeue updates for the 2nd RibOut
because RibOutUpdates::PeerDequeue returned success. We hit an assertion
in RibOutUpdates::PeerDequeue when called for the 2nd RibOut because the
original peer is not in the send ready set anymore.

Fix is to stop processing RibOuts for the peer if it's send blocked when
RibOutUpdates::PeerDequeue returns. This ensures that we don't hit the
assertion since we don't try to process the 2nd RibOut. Updates for the
2nd RibOut will be sent to the other peer when it's WorkPeer item gets
processed.

Change-Id: Ib1ef218ad9eecb1ca489b3045bdc3419e75caa21
Closes-Bug: 1386460

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/5575
Committed: http://github.org/Juniper/contrail-controller/commit/79965052c2309953b9a290739c078d35fc827e34
Submitter: Zuul
Branch: R2.0

commit 79965052c2309953b9a290739c078d35fc827e34
Author: Nischal Sheth <email address hidden>
Date: Thu Dec 11 13:58:44 2014 -0800

Fix corner case in SchedulingGroup::UpdatePeerQueue logic

An assertion fails if a peer gets blocked when dequeueing updates from
multiple RibOuts via SchedulingGroup::UpdatePeer.

Problem happens in the following situation:

- Peer was previously blocked and now has updates to send for 2 RibOuts.
- Updates for both RibOuts are for the same queue i.e. QBULK or QUPDATE.
- Peer shares a marker for the for the first RibOut with another peer or
peer's marker gets merged with marker for another peer when sending
updates for first RibOut (via RibOutUpdates::PeerDequeue)
- There are still more updates to be sent for the first RibOut i.e. the
processing in RibOutUpdates::PeerDequeue keeps going.
- Original peer gets send blocked, but we manage to dequeue all updates
for the first RibOut to the other peer with which the original peer's
marker got merged.
- RibOutUpdates::PeerDequeue returns true because of the previous point

At this point, we continue and try to dequeue updates for the 2nd RibOut
because RibOutUpdates::PeerDequeue returned success. We hit an assertion
in RibOutUpdates::PeerDequeue when called for the 2nd RibOut because the
original peer is not in the send ready set anymore.

Fix is to stop processing RibOuts for the peer if it's send blocked when
RibOutUpdates::PeerDequeue returns. This ensures that we don't hit the
assertion since we don't try to process the 2nd RibOut. Updates for the
2nd RibOut will be sent to the other peer when it's WorkPeer item gets
processed.

Change-Id: Ib1ef218ad9eecb1ca489b3045bdc3419e75caa21
Closes-Bug: 1386460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.