Restarting Neutron floods Nova with segment aggregates calls
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Confirmed
|
High
|
Miguel Lavalle |
Bug Description
* High level description:
Whenever we restart neutron-server, we see a huge number of requests (hundreds of thousands, overwhelming our control plane over the course of ~8 hours) going from neutron-server to nova-api. These requests are related to the segment aggregates, more specifically we see, for each of our hypervisors and for each segment, a GET to [URL] in order to get the aggregate id for the segment and a POST to [URL] in order to (try to) add the hypervisor to the aggregate, which fails because the host already exists there. These calls originate from this "_add_host_
To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC worker processes engages in the same procedure independently. This procedure is the following:
- When a process starts, the variable "reported_hosts" here https:/
- Each time a neutron agent on an hypervisor (in our case the openvswitch_agent) does its periodical state report, it sends a message over RabbitMQ
- This message gets randomly picked by one of the neutron-server worker processes. We enter the method "_update_
- Then, the next time the agent for this same hypervisor sends a state report message, there is two possibilities:
- case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no request is sent
- case 2, the message is picked by a different RPC worker. We then arrive in the same scenario as for the previous message.
At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we see an enormous amount of messages sent as the beginning and a trickle at the end.
- Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls, this is the stable state of the system and the reason why the calls only start when we restart neutron-server.
Included as attachment is an excerpt of logs using a debug line that we added in our code in our test environment, that shows the "reported_hosts" set growing in parallel for each RPC worker process.
This whole procedure is happening in parallel for all hosts. The important number of times the same series of calls for the same hypervisor is repeated is due to the fact that the "reported_hosts" set is a simple Python variable, purely local to each RPC worker process. In our case we have 4 neutron-server instances that each have 9 regular RPC workers and 9 state report RPC worker, hence the number of processes is 4×18=72 copies of the variable.
We calculated that a rolling restart of our neutron-server instances (say for an update deployment) will ultimately generate 300 hypervisors × 100 networks × 72 processes × 2 (one GET then one POST) = 4 320 000 calls to Nova.
* Workaround
Included is a minimal patch that we are considering applying to our internal branch that disables the whole "re-register to Nova at each restart" logic, while still keeping the possibility to execute this re-registration by restarting the neutron openvswitch agent on the hypervisors (thanks to the start_flag).
* Pre-conditions: what is the initial state of your system? Please consider enumerating resources available in the system, if useful in diagnosing the problem. Who are you? A regular tenant or a super-user? Are you describing service-to-service interaction?
We're operating a quite big deployment, our biggest region where this issue is the most impactful is 300 hypervisors over 2 AZs, 100 provider routed networks (we don't do SDN), 4 baremetal control plane nodes with 24 hyperthreads and 250GB RAM each, on which all Openstack servers are deployed (notably Nova, Neutron, Glance, RabbitMQ and HAProxy) except MySQL Galera which is alone on another 4 baremetal nodes.
* Step-by-step reproduction steps:
$ docker restart neutron_server
* Expected output: nothing
* Actual output: Hundreds of thousands of calls are emitted from neutron-server to nova-api, crippling our control plane for hours.
* Version:
** OpenStack version: Ussuri
** Linux distro, kernel: CentOS 7 for the host, CentOS 8 for the Kolla containers
** DevStack or other _deployment_ mechanism: Kolla-ansible 10.2
* Perceived severity: the call flood is basically taking our control plane down due to the load for the first 4 hours, and severely impacting its performance for the 4 next hours, everytime we restart neutron-server. As a result we are severely restrained in our capacity to restart neutron-server, which should be a non-event.
Changed in neutron: | |
status: | New → Confirmed |
importance: | Undecided → High |
Changed in neutron: | |
assignee: | nobody → Miguel Lavalle (minsel) |
When Neutron server is starting, the in-memory variable ‘reported_hosts’ is an empty set. Whenever the segments reported by agents are already been initialized; The algorithm will execute the process of reporting mappings host/segments to Nova by creating for each segment an aggregate and by adding hosts.
We can notice some points regarding the current process:
- It only reports mappings but does not report mapping that get removed (hosts or segments deleted)
- It reports mappings that already exist which lead to a terrible flood in Nova for large deployments using segments.
Even if mappings and aggregates are persistent, it seems reasonable during a restart of Neutron to execute such rebuilding process.
To fix the issue, one suggestion is to initialize an in-memory datastructure shared by workers with hosts/segments mappings. When Neutron starts, the first worker that could acquire the lock would retrieves from database the mappings. The process could then compare and validate mappings received by agents to report changes if needed.
If there are no changes on a mapping, we could expect to have the related aggregate also in-sync.
Miguel, does that sounds reasonable for you? I can see you assigned to the ticket would you share with us your thinking?