The mirror prober should check a few files from each mirror in paralel instead of a lot of files from a single mirror

Bug #54791 reported by Guilherme Salgado
16
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Medium
Guilherme Salgado

Bug Description

The prober checks a lot of files from a single mirror in paralel, which in some cases might make the mirror think it's being DOSed, causing it to stop answering the prober requests. This is speccially a problem when we check for architectures not mirrored in a given mirror, as the prober will be checking for a lot of unexistent files, making the mirror think it's being DOSed.

To properly solve this we need to make the prober check only a few files from each mirror at a time.

Changed in launchpad:
assignee: nobody → salgado
importance: Untriaged → Medium
status: Unconfirmed → Confirmed
Revision history for this message
Guilherme Salgado (salgado) wrote :

Fixing this would also allow us to probe some ftp-only mirrors (like https://launchpad.net/distros/ubuntu/+mirror/uk-mirror-service-cd) that limit the number of connections for each IP address.

Revision history for this message
Guilherme Salgado (salgado) wrote :

Andrew,

Would it be possible to randomize the deferreds after they're added to
the DeferredSemaphore? This may not be very reliable, but since we'll be probing 100 or more mirrors in paralel, it's very unlikely that we'd have problems with this solution.

If you don't like it or it's not possible, can you see an easy solution to this problem?

Revision history for this message
Andrew Bennetts (spiv) wrote :

I don't like randomisation. It's not 100% reliable, and it makes writing tests awkward.

Currently, we just have a single semaphore, limiting overall concurrent connections.

What we need instead is a semaphore per host, limiting concurrent connections per host. We probably also want to preserve overall limiting as well.

The code could look something like this:

  host_semaphores = {}
  for url in urls:
      sem = host_semaphores.setdefault(url.host, DeferredSemaphore(2))
      deferred = sem.run(prober.probe)
      # from here down is unchanged
      deferred.addErrback(callbacks.logMissingURL, url)
      deferredList.append(deferred)
      # etc ...

If we want to preserve the overall concurrency, we'd probably need to extend it to be something like:

  host_locks = {}
  for url in urls:
      multi_lock = MultiLock([overall_semaphore, DeferredSemaphore(2)])
      multi_lock = host_locks.setdefault(url.host, multi_lock)
      deferred = multi_lock.run(prober.probe)
      # continues as before...

(Although perhaps it would be clearer if I didn't use setdefault...)

Here "overall_semaphore" is the existing "semaphore" in cronscripts/distributionmirror-prober.py. MultiLock is this helper class:

class MultiLock(twisted.internet.defer._ConcurrencyPrimitive):
    """A lock that acquires multiple underlying locks before it is acquired."""

    def __init__(self, locks):
        twisted.internet.defer._ConcurrencyPrimitive.__init__(self)
        self.locks = locks

    def acquire(self):
        return gatherResults([lock.acquire() for lock in self.locks])

    def release(self):
        for lock in self.locks:
            lock.release()

(I also sketched out the following test while writing this helper TDD style:

    def testMultiLock(self):
        sem1 = defer.DeferredSemaphore(1)
        sem2 = defer.DeferredSemaphore(1)
        d1 = defer.Deferred()

        # keep sem1 busy
        sem1.run(lambda: d1)

        self.multi_acquired = 0
        self.multi_acquired2 = 0
        multi_sem = MultiLock([sem1, sem2])
        multi_sem.run(lambda: setattr(self, 'multi_acquired', 1))
        multi_sem.run(lambda: setattr(self, 'multi_acquired2', 1))
        self.assertFalse(self.multi_acquired)
        self.assertFalse(self.multi_acquired2)

        # release sem1
        d1.callback(None)

        # multi_sem will now have been able to acquire both semaphores, and so
        # it will have run its task.
        self.assertTrue(self.multi_acquired)
        self.assertTrue(self.multi_acquired2)

        # keey multi_sem busy
        d1 = defer.Deferred()
        multi_sem.run(lambda: d1)
        self.multi_acquired = 0
        multi_sem.run(lambda: setattr(self, 'multi_acquired', 1))
        self.assertFalse(self.multi_acquired)
        # unblock multi_sem
        d1.callback(None)
        self.assertTrue(self.multi_acquired)

)

Revision history for this message
Guilherme Salgado (salgado) wrote :

Today we have around 120 release mirrors and the same number of archive mirrors and I don't expect we'll reach a prohibitively high number of mirrors in the near future. Also, we don't probe archive and release mirrors at the same time, so I think there's no reason for keeping the overall limit. What do you think?

Changed in launchpad:
status: Confirmed → In Progress
Revision history for this message
Andrew Bennetts (spiv) wrote : Re: [Bug 54791] Re: The mirror prober should check a few files from each mirror in paralel instead of a lot of files from a single mirror

I'm happy with your judgement here, you know the requirements of the problem
(and of the admins) better than I do. (I'm also happy that the simpler code
will be sufficient). Just be sure to make it clear in the code (i.e. add
appropriate comments) what the behaviour of the code is, and what its
limitations are. That way it'll be easy to re-evaluate the solution if the
situation changes.

In short: DO IT! :)

Changed in launchpad:
status: In Progress → Fix Committed
Changed in launchpad:
status: Fix Committed → Fix Released
Revision history for this message
G G Papazoglou (grp) wrote :

Same problem here, ours is mirrors.med.uoc.gr

Perhaps the 'temporary removal from the mirror list' feature should be disabled until the problem is fixed, since it is not the mirror's responsibility.

Revision history for this message
Guilherme Salgado (salgado) wrote :

This problem has been solved already. You can see that by the bug's status (Fix Released).

The mirror you mentioned has been removed because it didn't contain the ISOs for the 6.06.2 release (at the time it was checked), as we see in the logfile (http://launchpadlibrarian.net/11461464/mirrors.med.uoc.gr-release-probe-logfile.txt). This is normal because it usually takes some time for a mirror to sync with the main repository once new ISOs are released and your mirror should be added back to the list once it is in sync.

Revision history for this message
Dmitry Sherman (admin-interhost) wrote :

we have the same problem, i need the prober's ip addresses to add it to the whitelist.

Curtis Hovey (sinzui)
tags: added: tech-debt
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.