SDN journal threads can thrash in multiprocess environment
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mellanox backend integration with Neutron (networking-mlnx) |
In Progress
|
Undecided
|
Mark Goddard |
Bug Description
In a typical production environment, neutron server runs on multiple hosts, with multiple processes on each host. Each process has an SDN journal thread to process the journal entries. This means there can be many instances of the journal thread, each processing the same table of data, at an interval (default 10 seconds).
If for some reason when processing a row the request fails, or it is skipped due to dependencies, then it is immediately moved back to the pending state. In this state, another journal thread is able to pick it up and move it to processing again immediately. Given that there are (N hosts * M processes) journal threads, each checking every 10 seconds, a row with invalid dependencies or a failing request might move between pending and processing many times per second. This thrashing is unnecessary, and can lead to many log messages such as this:
DELETE Port f952fed6-
Also, this can place a heavy load on the database. The contention on these rows can lead to logs such as this being generated by Galera:
BF-BF X lock conflict,mode: 1027 supremum: 0
conflicts states: my 0 locked 0
RECORD LOCKS space id 1235 page no 111 n bits 80 index `GEN_CLUST_INDEX` of table `neutron`
I suggest that some sort of rate limiting be applied in the 'get_oldest_
For some reason I don't see such thrashing behaviour with the maintenance thread - only one process per host maintains the journal (based on the logs). This is odd given the journal and maintenance threads are started at the same time, although they do use different mechanisms (loopingcall vs python threads).
A retry interval would also make the retry mechanism more sane, ensuring that it is possible to retry over a long enough period of time that an error can be determined to be permanent rather than transient.