monitoring oops rates is hard
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
python-oops-tools |
Triaged
|
High
|
Unassigned |
Bug Description
We'd like to be able to alert on an oops spike (ideally per source (U1/ISD/LP etc) happening, as that would let us find out before everything comes crashing down around our ears: e.g. recent DOS attacks, and poor APIs would have been picked up.
One way to do this would be to have a stream of oops metadata (e.g. project, time, samples since last minute) that can be consumed by e.g. esper or custom code. This might be amqp based, or stdout. Polling is possible but scales poorly. We probably don't want the spike analysis code analysing the full size of each oops, so a separate network of consumers is likely sensible.
E.g.: implementation sketch.
amqp2disk -> stream of oops metadata messages -> aggregator which tracks smoothed rate over time and alerts on spikes.