The original rt #39614 that related to this was about nagios integration, but grew into moving the package importer under losa control. So I'm posting a bit of 'why this can and cannot work today' to this bug. Though really there are probably several bugs that should be split out of my post.
We talked a fair amount of this at the recent sprint. I wasn't aware of this rt, though I was aware of trying to get the importer under LOSA control.
For a quick summary:
I think we can migrate to another machine with minimal fuss.
We'll still need direct login to the new machine for the foreseeable
future because most maintenance tasks (restarting a failing import)
require manual intervention.
I would like to see at least a little nagios integration, so that we
can move polling the state of the import from being manually done to
being automated.
At the moment, there are a few steps of this, which I think are relevant.
1) package-import is currently monitored manually. Which prior to this
week basically meant whenever James Westby got around to checking
on it. (Or someone complained sufficiently about a failure.)
It would be nice to get some level of nagios warning/critical so
that we don't have to manually poll the service.
Since the imports aren't perfect yet, we can't just say "we have any
failing imports", but we could say "we normally have 500 failed
imports, and now we have 1000". Which would help catch the "can no
longer reach archive.debian.org through Canonical's firewall" cases.
As we improve the UDD workflow, eventually this sort of
infrastructure either becomes critical, or becomes obsolete. (People
start depending on the branches to exist, but they may also start
creating the branches directly, rather than having the importer
doing the work.)
2) Jubany is a powerful server which is meant to be assigned to another task.
a) We shouldn't need this much hardware. It really depends on the QoS
we want to provide after major updates. Most of the time there
aren't huge numbers of packages getting .deb updates. Except when
we open up a new release series, etc. Also notable here are when
we fix a major bug and suddenly 600 packages need to be re-scanned.
b) Load on the system can probably be easily tuned by how many
parallel imports we run. On Jubany it is 8. This relates to how
many CPUs, how much peak memory, etc.
c) The code isn't particularly optimized for low load per import yet.
Depends on whether it is better to tweak that, or just spend $ for
more hardware.
d) The system doesn't scale to multiple machines particularly well.
It currently uses an sqlite database for tracking its state. We
could probably migrate it to a postgres db, etc, and then have a
clearer way to scale it horizontally. (Ideally you could run it as
a cloud-ish service, and then on a new release just fire up 20
instances to churn through the queue.)
e) Anyway, no real blockers *today* to just hosting the service on a
new machine, as long as the state gets copied over correctly.
(just copying the /srv/package-import.canonical.com/new directory
is probably sufficient.)
3) Restricting access to the machine, so that LOSAs are the only ones with direct access.
a) The way the process works today, this is not feasible.
At least partially because of the sqlite state db. We would need
to expose a fair amount of functionality in order to do regular
maintenance.
b) For example, a package can get a temporary failure (connection
reset, etc). If this failure hasn't been seen before, it gets
marked as failing immediately, and needs manual intervention to
get going again.
It would be possible to add automatic retry for all failures.
James was concerned about data corruption, but has state that he
hasn't seen any failures that would corrupt anything if they were
tried again. Stuff that would cause inconsistency at least
consistently fails.
c) On the other hand, there are still some race conditions, which
means that a package can get wedged if someone is adding new data
to the packaging branch, which the importer is trying to also add
to. Resolving this is still *very* manual, as it involves figuring
out what actually happened, then resetting state accordingly.
Some of this could be a button click "ignore local state". But the
importer actively processes multiple branches at a time, so it is
possible to have the state of the Launchpad branches get in a
weird state. (upstream vs debian vs ubuntu branches could all have
tags pointing at different revisions, claiming that they were all
'upstream-release-1.2')
If we really want to get this machine hidden behind the iron
curtain, then as we encounter issues, we can slowly generate more
external openings for us to fix things.
However, it is going to be a while before we have enough of them
to not avoid pestering a LOSA to do something specific for each of
the 500-ish failures we have today.
4) If we are doing major rework of the import system, we might consider
trying to rephrase the problem in terms of the vcs-import system.
Which already has hardware and process infrastructure to handle some
similar issues. (I need to import #N jobs, fire them off, make sure
they are running, report when they are failing, etc.)
The original rt #39614 that related to this was about nagios integration, but grew into moving the package importer under losa control. So I'm posting a bit of 'why this can and cannot work today' to this bug. Though really there are probably several bugs that should be split out of my post.
We talked a fair amount of this at the recent sprint. I wasn't aware of this rt, though I was aware of trying to get the importer under LOSA control.
For a quick summary:
I think we can migrate to another machine with minimal fuss.
We'll still need direct login to the new machine for the foreseeable
future because most maintenance tasks (restarting a failing import)
require manual intervention.
I would like to see at least a little nagios integration, so that we
can move polling the state of the import from being manually done to
being automated.
At the moment, there are a few steps of this, which I think are relevant.
1) package-import is currently monitored manually. Which prior to this
week basically meant whenever James Westby got around to checking
on it. (Or someone complained sufficiently about a failure.)
It would be nice to get some level of nagios warning/critical so
that we don't have to manually poll the service.
Since the imports aren't perfect yet, we can't just say "we have any
failing imports", but we could say "we normally have 500 failed
imports, and now we have 1000". Which would help catch the "can no
longer reach archive.debian.org through Canonical's firewall" cases.
As we improve the UDD workflow, eventually this sort of
infrastructure either becomes critical, or becomes obsolete. (People
start depending on the branches to exist, but they may also start
creating the branches directly, rather than having the importer
doing the work.)
2) Jubany is a powerful server which is meant to be assigned to another task.
a) We shouldn't need this much hardware. It really depends on the QoS
we want to provide after major updates. Most of the time there
aren't huge numbers of packages getting .deb updates. Except when
we open up a new release series, etc. Also notable here are when
we fix a major bug and suddenly 600 packages need to be re-scanned.
b) Load on the system can probably be easily tuned by how many
parallel imports we run. On Jubany it is 8. This relates to how
many CPUs, how much peak memory, etc.
c) The code isn't particularly optimized for low load per import yet.
Depends on whether it is better to tweak that, or just spend $ for
more hardware.
d) The system doesn't scale to multiple machines particularly well.
It currently uses an sqlite database for tracking its state. We
could probably migrate it to a postgres db, etc, and then have a
clearer way to scale it horizontally. (Ideally you could run it as
a cloud-ish service, and then on a new release just fire up 20
instances to churn through the queue.)
e) Anyway, no real blockers *today* to just hosting the service on a import. canonical. com/new directory
new machine, as long as the state gets copied over correctly.
(just copying the /srv/package-
is probably sufficient.)
3) Restricting access to the machine, so that LOSAs are the only ones with direct access.
a) The way the process works today, this is not feasible.
At least partially because of the sqlite state db. We would need
to expose a fair amount of functionality in order to do regular
maintenance.
b) For example, a package can get a temporary failure (connection
reset, etc). If this failure hasn't been seen before, it gets
marked as failing immediately, and needs manual intervention to
get going again.
It would be possible to add automatic retry for all failures.
James was concerned about data corruption, but has state that he
hasn't seen any failures that would corrupt anything if they were
tried again. Stuff that would cause inconsistency at least
consistently fails.
c) On the other hand, there are still some race conditions, which
means that a package can get wedged if someone is adding new data
to the packaging branch, which the importer is trying to also add
to. Resolving this is still *very* manual, as it involves figuring
out what actually happened, then resetting state accordingly.
Some of this could be a button click "ignore local state". But the release- 1.2')
importer actively processes multiple branches at a time, so it is
possible to have the state of the Launchpad branches get in a
weird state. (upstream vs debian vs ubuntu branches could all have
tags pointing at different revisions, claiming that they were all
'upstream-
If we really want to get this machine hidden behind the iron
curtain, then as we encounter issues, we can slowly generate more
external openings for us to fix things.
However, it is going to be a while before we have enough of them
to not avoid pestering a LOSA to do something specific for each of
the 500-ish failures we have today.
4) If we are doing major rework of the import system, we might consider
trying to rephrase the problem in terms of the vcs-import system.
Which already has hardware and process infrastructure to handle some
similar issues. (I need to import #N jobs, fire them off, make sure
they are running, report when they are failing, etc.)