monitoring oops rates is hard

Bug #1018574 reported by Robert Collins
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
python-oops-tools
Triaged
High
Unassigned

Bug Description

We'd like to be able to alert on an oops spike (ideally per source (U1/ISD/LP etc) happening, as that would let us find out before everything comes crashing down around our ears: e.g. recent DOS attacks, and poor APIs would have been picked up.

One way to do this would be to have a stream of oops metadata (e.g. project, time, samples since last minute) that can be consumed by e.g. esper or custom code. This might be amqp based, or stdout. Polling is possible but scales poorly. We probably don't want the spike analysis code analysing the full size of each oops, so a separate network of consumers is likely sensible.

E.g.: implementation sketch.

amqp2disk -> stream of oops metadata messages -> aggregator which tracks smoothed rate over time and alerts on spikes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.