Launchpad itself

buildd-manager doesn't give us a good way of determining it's in a failed state

Bug #451351 reported by Tom Haddon on 2009-10-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	High	Unassigned

Bug Description

After a DB restart this morning we were seeing the following pattern in the logs:

2009-10-14 09:00:40+0100 [-] Starting scanning cycle.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Scanning failed with: Already disconnected
2009-10-14 09:00:40+0100 [-] Finishing scanning cycle.
2009-10-14 09:00:40+0100 [-] Scanning cycle finished.

However, the process was responding to nagios checks fine. As a result, we were only able to tell something was wrong based on user feedback.

Tags:

Tom Haddon (mthaddon) on 2009-10-14

Changed in soyuz:
importance:	Undecided → High

Julian Edwards (julian-edwards) on 2009-10-14

Changed in soyuz:
status:	New → Triaged
tags:	added: soyuz-build

Julian Edwards (julian-edwards) on 2009-10-21

tags:

added: tech-debt

Julian Edwards (julian-edwards) on 2009-12-14

tags:

added: buildd-manager

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2009-12-15:

The internal log reporting for scan failures is currently very obtuse. It could do with adding a stack trace to the error shown. This is very easy by doing something like this:

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py 2009-07-26 14:19:49 +0000
+++ lib/lp/buildmaster/manager.py 2009-12-14 20:46:44 +0000
@@ -238,6 +238,7 @@
         """Deal with scanning failures."""
         self.logger.info(
             'Scanning failed with: %s' % error.getErrorMessage())
+ error.printTraceback()
         self.finishCycle()

Tom Haddon (mthaddon) on 2010-05-28

tags:

added: canonical-losa-lp

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-10-21:

The idea check is one where we look at the queue for a particular architecture and then if it has outstanding builds, but the builders for that arch are idle for > N seconds, then we raise a Nagios error.

We should be able to make the relevant data available on the API.

Revision history for this message

Robert Collins (lifeless) wrote on 2012-01-03:

Also on the error side - OOPS FTW. :)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.