From a (possibly naive) outstander, it seems that the architecture is over-complex. I guess it is due to you having an enormous amount of data, and a need to regular heavy lifting whenever source code repositories are updated, all for a file format that is rather transaction and query unfriendly.
i.e. a combination of:
- hosting a lot of projects
- needing to regularly hook into SCMs
- database-unfriendly file format
It seems to me (again, I could be naive - forgive me) that storing stuff as po's is a bad idea, and you should just put it in a regular relational database, and convert it to po on demand. That would allow easy querying, efficient indexing, etc, without having to try and keep all this stuff in sync. I would imagine po imports are relatively rare (computationally-speaking), so the conversion would not be too costly. Maybe pos do change more like that, or maybe there is a good reason to stick to a native po model, or maybe you're heavily invested in here, but I just wanted to give my outsider's perspective.
My other point of view is that, instead of a background process running over huge amounts of data, do a foreground process triggered by a when a logged in user views the index of po's? Not every time, and not necessarily in real-time, but maybe trigger it at that point, if it hasn't happened already since the last pot upload.
That way, you can replace the need to do routine heavy lifting across your entire architecture with the need to keep po listings reasonable fresh from (I guess) the minority of users actually using them (I am just guessing here that there are a lot of po's in your architecture that rarely get touched).
Last thing, I wonder if your background task(s) aren't so smart. Maybe you are reparsing whole po files but maybe you can do a checksum check first to see if they've actually changed? Maybe there are other ways you can avoid doing full reparses (file-mod times, file-sizes).
Just my 3.149 cents. Again, I know nothing about your code - I am just coming in as an outsider.
From a (possibly naive) outstander, it seems that the architecture is over-complex. I guess it is due to you having an enormous amount of data, and a need to regular heavy lifting whenever source code repositories are updated, all for a file format that is rather transaction and query unfriendly.
i.e. a combination of:
- hosting a lot of projects
- needing to regularly hook into SCMs
- database-unfriendly file format
It seems to me (again, I could be naive - forgive me) that storing stuff as po's is a bad idea, and you should just put it in a regular relational database, and convert it to po on demand. That would allow easy querying, efficient indexing, etc, without having to try and keep all this stuff in sync. I would imagine po imports are relatively rare (computationall y-speaking) , so the conversion would not be too costly. Maybe pos do change more like that, or maybe there is a good reason to stick to a native po model, or maybe you're heavily invested in here, but I just wanted to give my outsider's perspective.
My other point of view is that, instead of a background process running over huge amounts of data, do a foreground process triggered by a when a logged in user views the index of po's? Not every time, and not necessarily in real-time, but maybe trigger it at that point, if it hasn't happened already since the last pot upload.
That way, you can replace the need to do routine heavy lifting across your entire architecture with the need to keep po listings reasonable fresh from (I guess) the minority of users actually using them (I am just guessing here that there are a lot of po's in your architecture that rarely get touched).
Last thing, I wonder if your background task(s) aren't so smart. Maybe you are reparsing whole po files but maybe you can do a checksum check first to see if they've actually changed? Maybe there are other ways you can avoid doing full reparses (file-mod times, file-sizes).
Just my 3.149 cents. Again, I know nothing about your code - I am just coming in as an outsider.