After rabbitmq failover neutron-server may get into state when it can't declare an exchange and consumes ~100% cpu

Bug #1493732 reported by Eugene Nikanorov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Critical
Oleg Bondarev
7.0.x
Fix Released
Critical
Eugene Nikanorov
8.0.x
Fix Released
Critical
Oleg Bondarev

Bug Description

Fuel build #287, scale lab 10

In certain conditions (high load, rabbitmq failover), neutron-server may get into state when it tries to declare an exchange in message queue brocker and fails.
That leads to a kind of busy-spin resulting in inability for agents to retrieve information from server via RPC.

Agents also can't report state to server which looks like some agents are down from time to time.

User impact: in such environment basic cloud features likely to work unstable or doesn't work at all (such as: spawning VMs, external network access, ability to receive fixed ip address for VM)

Changed in fuel:
status: New → Confirmed
no longer affects: mos
Revision history for this message
Alexander Ignatov (aignatov) wrote :

This is a regression caused by commit during synchronising upstream Kilo puppet manifests to 7.0 https://review.openstack.org/#/c/189678/ (parameter database_max_pool_size was set to 10) but even in 5.0 we set this parameter to 50 by https://review.openstack.org/#/c/100859/4/deployment/puppet/openstack/manifests/controller.pp

Changed in fuel:
milestone: none → 7.0
importance: Undecided → Critical
assignee: nobody → Alexander Ignatov (aignatov)
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

The following message with a trace can be seen anywhere in server logs:

2015-09-09 08:30:17.830 35809 ERROR oslo_messaging.rpc.dispatcher [req-daa604ad-6e55-4aa7-855b-b28f939e5147 ] Exception during message handling: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10

Also, the issue leads to 100% cpu consumption by neutron-sever

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/221650

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/221653

Revision history for this message
Eugene Nikanorov (enikanorov) wrote : Re: Amount of neutron-server RPC workers is not enough to handle scale lab with DVR

Further analysis shows that the problem is neither in amount of workers, nor in the size of sql connection pool.

In certain conditions (rabbitmq failover) rpc workers fails to declare an exchange, which leads rpc worker to retry sending for hundreds of time. This makes it consume nearly 100% of cpu and eliminates it from the pool of workers that are able to server rpc requests. Amount of such rpc workers will grow quite fast, remaining "alive" workers quickly exhaust their connection pool producing messages about connection pool overflow.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :
summary: - Amount of neutron-server RPC workers is not enough to handle scale lab
- with DVR
+ After rabbitmq failover neutron-server may get into state when it can't
+ declare an exchange and consumes ~100% cpu
description: updated
Changed in mos:
assignee: nobody → MOS Oslo (mos-oslo)
status: New → Confirmed
importance: Undecided → Critical
Changed in mos:
milestone: none → 7.0
no longer affects: fuel
no longer affects: fuel/7.0.x
no longer affects: fuel/8.0.x
tags: added: scale
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue with many messages in log
NotFound: Exchange.declare: (404) NOT_FOUND - no exchange 'reply_8e21a5c3e01842298b8e100a2f52d2e2' in vhost '/'

is filed separately there - https://bugs.launchpad.net/mos/+bug/1494416

The initial issue with l3 agents is not connected with it, at it started much earlier.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Thanks Dmitry for confirming that this issue is still with Neutron team and we have a related issue which we can work on separately from this Critical issue.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/11513

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/11515

Revision history for this message
Oleg Bondarev (obondarev) wrote :

update on bugs:

https://bugs.launchpad.net/mos/7.0.x/+bug/1493732 (current)
https://bugs.launchpad.net/mos/+bug/1494416

Questions:

1. Are they related?
2. Is it correct that they only reproduce only when all of agents are restarted at once?
3. If yes, what are the possible real-life scenarios for this situation to occur except for our rally tests?

Quoting Eugene's email here:

"In fact we have a bunch of interrelated issues.
At the point when the first bug was filed, we didn't have exact explanation of what brought neutron down.
The flow was identified later.

Answering the questions:
1) They are related in regard that they both add to the problem, they compliment each other making whole issue more frequent and cloud recovery more painful.
2) Not exactly. While mass restart indeed kills the cloud, initially it got there during "light" rally tests with moderate load and moderate amount of resources in the cloud overall.
So the problem here is that there is a certain load or a burst of load, that triggers the disaster, it doesn't have to be too much or too long.
It could me instant spawn of 50-100 VMs, for instance.

What makes an issue critical is that if cloud gets into such nearly-dead state, it only could be recovered with agent restart, which, if done with incorrect timing, will trigger the issue again.

The technical essence of the problem is that current neutron's "self-healing mechanisms" may trigger the load on itself, that current code (due to some bugs and architectural issues) can't handle, making it constantly unhealthy, like in a loop.
It also affects rabbitmq which is restarted by our HA scripts due to high load and monitoring timeouts.

Regarding blocking the release... I think DVR feature is widely awaited, and people will start performing failover tests.
AFAIK we don't do many failover tests, especially on scale lab. I'm not saying that we should block the release, most fixes that we've done are fairly simple and I'd say that they're safe (at least 2 out of 3, the last one requires some additional testing)

I think if we are to release MOS right now, with those fixes we would be in much much better shape than before."

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/11515
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: dc884791dba2717a8b02da97f4c380ef3ecbda6d
Author: Oleg Bondarev <email address hidden>
Date: Tue Sep 15 11:56:02 2015

Do not update ACTIVE ports back to BUILD status

Status update (ACTIVE-BUILD-ACTIVE) may trigger a bunch of
unneeded RPC communications between neutron server and l3
dvr agents which may overload server fatally.
Updated ports will be put in PENDING_BUILD status right after
db update to distinguish real port update and cases when agents
are just restarted and syncing with server.

Related-Bug: #1493732
Related-Bug: #1494416
Change-Id: Ia65b901cb4829d00e829d0b2afbb246860bf0fe5

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/11513
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 46867deb289e88391f1fadeef010b69a535f444a
Author: Oleg Bondarev <email address hidden>
Date: Thu Sep 17 15:05:44 2015

L3 agent: skip routers notifications if fullsync is true

In case l3 agent is about to fullsync there is no point in processing
routers_updated notifications separately.
This should decrease the (unneeded) load on neutron server at high
scale.

Closes-Bug: #1493732
Related-Bug: #1494416
Change-Id: Ic20b767f34903e9bf14f4616632af3b8698dcebb

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/11532
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 4469346be75d65f5c9e249966e6f1a771a4936f6
Author: Ilya Shakhat <email address hidden>
Date: Thu Sep 17 15:28:46 2015

Do not specify host for l2population topics

When creating topics oslo.messaging automatically creates
topic with hostname suffix (e.g. topic.hostname), there's
no need to do this explicitly.

Related-Bug: #1493732
Closes-Bug: #1495513

Upstream: https://review.openstack.org/223088
Upstream-Bug: #1495508

Change-Id: Iaedddf83517a6a90b7bfa281b7e32e9013f7a78c

Revision history for this message
Sergey Shevorakov (sshevorakov) wrote :

Added to 7.0 MU1, since it needs QA verification.

tags: added: 70mu1-confirmed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on puppet-modules/puppet-neutron (mos-8.0)

Change abandoned by Sergey Kolekonov <email address hidden> on branch: mos-8.0
Review: https://review.fuel-infra.org/11661

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Ilya Shakhat <email address hidden>
Review: https://review.fuel-infra.org/13309

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/13320

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/13323

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

Change abandoned by Eugene Nikanorov <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13309
Reason: Got into liberty, commit #a8d0586fdebfd28e407e2d30f72c92e3711d0a1e

tags: removed: 70mu1-confirmed
Revision history for this message
Alexander Ignatov (aignatov) wrote :

@Oleg, status of this bug is unclear, what steps needed to solve this issue in 8.0?

Revision history for this message
Oleg Bondarev (obondarev) wrote :

This bug is effectively saying "Neutron with DVR is not scalable enough", so there is no single fix for it.
There is a number of fixes landed in 7.0 (see bug comments above), more to go to 7.0 updates, same for 8.0.
The decision to close the bug in 8.0 should be made after scale tests.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/14140

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/14147

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/14153

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/14154

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

Change abandoned by Oleg Bondarev <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13320
Reason: It was abandoned in upstream, latest scale tests showed that it is not required given other dvr scale fixes

Revision history for this message
Mikhail Chernik (mchernik) wrote :

Reproduced on MOS 7.0 GA (build 301)
fuel-version: http://paste.openstack.org/show/481223/
neutron-all.log: http://paste.openstack.org/show/481224/

Revision history for this message
Oleg Bondarev (obondarev) wrote :
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Fixes are merger to 8.0, need to test on scale lab to be confident and close the bug for 8.0

Revision history for this message
Oleg Bondarev (obondarev) wrote :

After discussion in team it was decided to put bug in Fix Committed as fixes were merged: https://review.fuel-infra.org/#/c/13613/ https://review.fuel-infra.org/#/c/13612/ https://review.fuel-infra.org/#/c/13772/ https://review.fuel-infra.org/#/c/13611/

Theses fixes resolved the bug for 7.0 which was tested on 200 nodes

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/14140
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 868e711fa281180c645248d525e3b51495e96fb1
Author: Oleg Bondarev <email address hidden>
Date: Tue Nov 24 11:55:08 2015

DVR: notify specific agent when creating floating ip

Currently when floating ip is created, a lot of useless action
is happening: floating ip router is scheduled, all l3 agents where
router is scheduled are notified about router update, all agents
request full router info from server. All this becomes a big
performance problem at scale with lots of compute nodes.
In fact on (associated) Floating IP creation we really need
to notify specific l3 agent on compute node where associated
VM port is located and do not need to schedule router and
bother other agents where rourter is scheduled. This should
significally decrease unneeded load on neutron server at scale.

Partial-Bug: #1512635
Related-Bug: #1500823
Related-Bug: #1493732
Partial-Bug: #1486828

Change-Id: I0cbe8c51c3714e6cbdc48ca37135b783f8014905

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Reviewed: https://review.fuel-infra.org/14147
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 00198bf9818f115d5acff9e6a5bef8080004c606
Author: Oleg Bondarev <email address hidden>
Date: Mon Dec 28 14:49:43 2015

DVR: only notify needed agents on new VM port creation

When a new VM which should be serviced by a DVR router appears
on compute host, this router is scheduled to that host and
notification is sent. Before the patch it was a broad notification
while really we only need to notify agent on target host.
This should decrease the load on neutron server at scale.

Closes-Bug: #1514762
Related-Bug: #1500823
Related-Bug: #1493732
Closes-Bug: #1486795

Conflicts:
 neutron/db/l3_dvrscheduler_db.py
 neutron/tests/functional/services/l3_router/test_l3_dvr_router_plugin.py
 neutron/tests/unit/scheduler/test_l3_agent_scheduler.py

Change-Id: Id48b6f6a71530c4f6092d2a07b2db1a5cd300c05

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Reviewed: https://review.fuel-infra.org/14154
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: dbcc3d300c57a4919080f8c58f9061088d90f31f
Author: Oleg Bondarev <email address hidden>
Date: Mon Dec 28 20:49:27 2015

DVR: notify specific agent when deleting floating ip

In DVR case we only need to notify the l3 agent on compute node
where associated fixed port is located.

Closes-Bug: #1512635
Related-Bug: #1500823
Related-Bug: #1493732
Closes-Bug: #1486828

Conflicts:

 neutron/tests/functional/services/l3_router/test_l3_dvr_router_plugin.py

Change-Id: I644238ca295c4eb6df75a99a8ef6143a801b27cb

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Reviewed: https://review.fuel-infra.org/14153
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 0723acb923d9c9939943ead91ea5830130299ddd
Author: Oleg Bondarev <email address hidden>
Date: Mon Dec 28 20:49:18 2015

DVR: Notify specific agent when update floatingip

The L3 agent was determined when update floatingip.
So notify the specific agent rather than notify all agents.
This will save some RPC resources. This is only for DVR routers.
Legacy and HA routers notify only the relevant agents.
This reproposes commit 52e91f48f2327b47f126893f9cb12f153380a9a6
which was reverted by commit a2f7e0343a147a30a637af4e1cb9a866f557e87d
because of Ironic gate failures.
Now the patch preserves original behavior for legacy routers and
should not break Ironic tests.

Partial-Bug: #1512635
Related-Bug: #1500823
Related-Bug: #1493732
Partial-Bug: #1486828

Conflicts:
 neutron/tests/functional/services/l3_router/test_l3_dvr_router_plugin.py

Change-Id: I4ef7a69ad033b979ea0e29620a4febfe5e0c30dd

tags: added: neutron
tags: added: area-neutron
removed: neutron
Revision history for this message
Ivan Lozgachev (ilozgachev) wrote :

Verified on ENV-13 build 518

tags: added: 8.0 release-notes-done
Revision history for this message
Sergii Rizvan (srizvan) wrote :

We have set status of the bug as Fix Released, because all necessary changes to fix the bug was merged within 7.0-MU2 milestone.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Correct, I will make sure that release notes will be updated accordingly

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

Change abandoned by Oleg Bondarev <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13323
Reason: Upstream patch was abandoned

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.