[RFE] Move to an agent-based design
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Networking ML2 Generic Switch |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
## Current design
Currently, the entire NGS code is dynamically loaded as a ML2 mechanism driver alongside a Neutron process. It means that all SSH operations (netmiko) are performed in the context of the Neutron process.
When Neutron is deployed with several threads or processes, we end up with several independent instances of NGS. There is an optional locking mechanism based on etcd to coordinate these NGS instances. There is an additional batching mechanism, also based on etcd, allowing each NGS instance to manage a queue of requests for a switch and send several commands over a single SSH connection for improved performance.
https:/
https:/
## Proposed agent-based design
We would like to move all SSH operations to a NGS agent, running as a single dedicated process that would take commands from Neutron through RPC calls / RabbitMQ.
It would bring the following benefits:
1) better performance: this is our main motivation.
a. the agent would run one eventlet thread per switch, so that it can parallelize configuration across different switches
b. each eventlet thread would be able to efficiently batch commands over a single SSH connection by continuous pulling commands from an in-memory queue
c. each eventlet thread could keep its SSH connection open for some time after the last command, so that future commands are faster
d. we benchmarked our agent-based prototype and it performs better than the current etcd-based batching mechanism, because we can batch more commands together.
2) simpler code and operation: no distributed locking is necessary, etcd would no longer be required
3) uwsgi compatibility: the current batching mechanism does not work when running Neutron with uwsgi, because uwsgi does not support asynchronous eventlet tasks that run outside of an API call. The agent design would overcome this limitation because all asynchronous tasks would be executed outside of Neutron itself.
4) security: secrets such as SSH passwords would only be accessible from the agent process, while currently the secrets are accessible from the main Neutron processes
5) flexibility: the NGS agent could run on a different host than Neutron, which may be required for security (e.g. access-list filters that only allow SSH connection from a "bastion") or performance (move the agent closer to the physical switches for lower latency)
## Design details
Only a small part of the NGS code would remain in the ML2 mechanism driver: it would simply listen for network/subnet/port creation or deletion events, perform basic checking, and then call the agent through the RPC bus and wait for the answer (success/fail) or timeout.
At startup, the agent spawns an eventlet greenthread + queue for each physical switch present in the configuration. Each greenthread loops indefinitely, waiting for commands to be pushed into its queue. When commands are received on the queue, the greenthread opens a SSH connection to the switch, performs as many commands in a row as possible (batching), then possibly waits a bit for new commands to arrive, and finally saves the configuration on the switch and closes the SSH connection.
When the main agent thread receives a request, it looks up which switch(es) should be configured, puts the request in the right queue(s), and waits for the eventlet greenthread(s) to finish processing the request(s). For a port creation/deletion, only a single switch needs to be configured. For a network creation/deletion, several switches need configuration in parallel, and the agent performs a "join" to wait for all parallel requests to complete.
## HA support
Some people want to run NGS with HA. The design should allow several agents to run in parallel and to fail-over if an agent crashes or fails. There are a few problems to solve:
- good batching performance: all requests for a given switch should be routed to the same agent to ensure it can batch them. It has been suggested that a hash-ring could help.
- concurrent requests for the same switch: several agents might try to configure the same switch concurrently. It works on some models, but other models cannot handle this. Possible solutions: rely on retries, or use the distributed etcd locks to limit concurrency.
## No backwards compatibility
It will be difficult to keep compatibility: we plan to cleanly separate the code of the ML2 mechanism driver and the new code of the agent. Using the agent would become mandatory, it means that NGS users will have to start an additional ngs-agent process after upgrading.
This point is open for discussion.
Changed in networking-generic-switch: | |
status: | New → Confirmed |
Previous RFE attempt to improve performance was https:/ /bugs.launchpad .net/neutron/ +bug/1976270 in Neutron, but modifying the interface of the Neutron ML2 mechanism drivers was not acceptable.