Possible scale issues with neutron-fwaas requesting all tenants with firewalls after RPC failures

Bug #1618244 reported by Sridar Kandaswamy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
Low
Bertrand Lallau

Bug Description

Information zzelle in conversation with njohnston

An overload is caused first by some neutron-servers crashed, secondly by every l3-agent trying to perform a "full" process_services_sync. When we restarted every crashed neutron-servers and purge neutron queues, we restarted crashed neutron-servers rpc workers are still overloaded because of full syncs.

About 60 L3Agents, with one router per L3Agent.

Key question: typically i don't understand why in full sync a l3-agents request tenants with FWs intead of requesting its tenants with FW ?

https://github.com/openstack/neutron-fwaas/blob/master/neutron_fwaas/services/firewall/agents/l3reference/firewall_l3_agent.py#L224

tags: added: fwaas
Revision history for this message
Sridar Kandaswamy (skandasw) wrote :

zzelle, thanks for bringing this up. I think what u are raising is - perhaps a change along the lines to check which routers are part of the specific L3Agent and get the firewalls corresponding to just those tenants is what u are asking for should bring down the scale in comparison to get all the tenants with firewalls.

One thing i also wanted to understand is that in terms of the messaging right after a recovery from an RPC failure - each agent will still check to see if it has firewalls associated, so we will have all these agents trying to ask the plugin, so the messaging may not quite come down. Perhaps we can this conversation to triage this to see what is the best way to move fwd and we can address that.

Changed in neutron:
assignee: nobody → Bertrand Lallau (bertrand-lallau)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-fwaas (master)

Fix proposed to branch: master
Review: https://review.openstack.org/424551

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/428034

Changed in neutron:
assignee: Bertrand Lallau (bertrand-lallau) → Reedip (reedip-banerjee)
Changed in neutron:
assignee: Reedip (reedip-banerjee) → Bertrand Lallau (bertrand-lallau)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-fwaas (master)

Reviewed: https://review.openstack.org/424551
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=bd8c8af160ba428f0b5526d232b479c46bf6b321
Submitter: Jenkins
Branch: master

commit bd8c8af160ba428f0b5526d232b479c46bf6b321
Author: Bertrand Lallau <email address hidden>
Date: Tue Jan 24 09:41:13 2017 +0100

    Fix scale issue in case of services_sync_needed v1

    Currently, if a RPC timeout occurs during a AMQP call, a full
    firewall sync is performed by L3 agent.

    Full sync process_services_sync actually works as bellow in v1:

    1. get_tenants_with_firewalls RPC call is sent
    2. neutron server respond with ALL tenants
    3. for each tenant (even if tenant is not schedule on this L3 agent):
       => retrieve firewall rules using RPC call get_firewalls_for_tenant

    This process is really inefficient, it will just flood a neutron server
    process which is already fully busy (cf: RPC Timeout on previous call).

    This patch return to get_tenants_with_firewalls call
    only tenants with firewalls which are scheduled on this agent.

    Same approach can be apply to firewall v2 version, it will be done
    in a next patch.

    Change-Id: I75a047e5ab8bec8893971ea2430c68bfb7027512
    Partial-Bug: #1618244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-fwaas (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/441869

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-fwaas (stable/ocata)

Reviewed: https://review.openstack.org/441869
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=7100c00894922d3c401c1021eb98e76775e7d594
Submitter: Jenkins
Branch: stable/ocata

commit 7100c00894922d3c401c1021eb98e76775e7d594
Author: Bertrand Lallau <email address hidden>
Date: Tue Jan 24 09:41:13 2017 +0100

    Fix scale issue in case of services_sync_needed v1

    Currently, if a RPC timeout occurs during a AMQP call, a full
    firewall sync is performed by L3 agent.

    Full sync process_services_sync actually works as bellow in v1:

    1. get_tenants_with_firewalls RPC call is sent
    2. neutron server respond with ALL tenants
    3. for each tenant (even if tenant is not schedule on this L3 agent):
       => retrieve firewall rules using RPC call get_firewalls_for_tenant

    This process is really inefficient, it will just flood a neutron server
    process which is already fully busy (cf: RPC Timeout on previous call).

    This patch return to get_tenants_with_firewalls call
    only tenants with firewalls which are scheduled on this agent.

    Same approach can be apply to firewall v2 version, it will be done
    in a next patch.

    Partial-Bug: #1618244
    (cherry picked from commit 1bb5d3a293a4d1ad31fd103969d9bc863f37dc09)

    Change-Id: I1a51ce3dc338bd52e6de7251573460d5aa4e4dd1

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron-fwaas (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/428034
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.