Comment 2 for bug 589521

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 589521] Re: nagios monitoring of package imports needed

On Wed, 26 Jan 2011 18:09:59 -0000, John A Meinel <email address hidden> wrote:
> For a quick summary:
> I think we can migrate to another machine with minimal fuss.
> We'll still need direct login to the new machine for the foreseeable
> future because most maintenance tasks (restarting a failing import)
> require manual intervention.

It would be great to make that clicky-clicky in the web UI, but I don't
think that's a trivial task.

> 1) package-import is currently monitored manually. Which prior to this
> week basically meant whenever James Westby got around to checking
> on it. (Or someone complained sufficiently about a failure.)
> It would be nice to get some level of nagios warning/critical so
> that we don't have to manually poll the service.

There was some discussion at the sprint about not overloading the LOSAs
with this, and perhaps notifying "us" rather than them when something
was wrong, but that would seem to be in conflict with it being a fully
LOSA-managed service.

> 2) Jubany is a powerful server which is meant to be assigned to another task.
> a) We shouldn't need this much hardware. It really depends on the QoS
> we want to provide after major updates. Most of the time there
> aren't huge numbers of packages getting .deb updates. Except when
> we open up a new release series, etc. Also notable here are when
> we fix a major bug and suddenly 600 packages need to be
> re-scanned.

The new series case could be optimised. Currently it does it the dumb
way, and we are just careful to stop the importer while we do the
shuffle of branches on codehosting to get the optimum disk usage from

> b) Load on the system can probably be easily tuned by how many
> parallel imports we run. On Jubany it is 8. This relates to how
> many CPUs, how much peak memory, etc.

It's trivial to tune this at run time.

IIRC 8 was the limit becuase any higher was adversely affecting
codehosting at times. That may have been due to the bug that John fixed
where we weren't reusing SSH connections correctly.

It would be good to know what the bottleneck is anyway, disk I/O or
network communication/codehosting.

> d) The system doesn't scale to multiple machines particularly well.
> It currently uses an sqlite database for tracking its state. We
> could probably migrate it to a postgres db, etc, and then have a
> clearer way to scale it horizontally. (Ideally you could run it as
> a cloud-ish service, and then on a new release just fire up 20
> instances to churn through the queue.)

There's another RT for postgres. Aside from that it currently uses file
locks to ensure only one process is active per-package at one time. It
may be possible to avoid that, or use something else for locking, but
there's nothing that it is doing that means it couldn't be scalable in
this manner.

> 3) Restricting access to the machine, so that LOSAs are the only ones
> with direct access.
> a) The way the process works today, this is not feasible.
> At least partially because of the sqlite state db. We would need
> to expose a fair amount of functionality in order to do regular
> maintenance.

Or request the LOSAs to do it on your behalf I guess? I imagine that
would become tiring to everyone very rapidly.

> 4) If we are doing major rework of the import system, we might consider
> trying to rephrase the problem in terms of the vcs-import system.
> Which already has hardware and process infrastructure to handle some
> similar issues. (I need to import #N jobs, fire them off, make sure
> they are running, report when they are failing, etc.)

I started drafting some notes for this:

Perhaps some of this has already been solved with e.g. the bzr-git
caches, and so it would be easier to just add a new vcs-imports job