oopses do not gather environmental data(load, thread-cpu-time, ...)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Triaged
|
High
|
Unassigned | ||
OOPS model |
Triaged
|
High
|
Unassigned | ||
python-oops-tools |
Triaged
|
High
|
Unassigned |
Bug Description
When timeouts occur, they can be caused by a) inefficient code or b) external influences.
We should gather enough data that we don't spend time debugging the wrong things.
Specifically we should gather:
- system load average
- number of cpucores (to normalise the load average)
- process memory & physical memory (to guesstimate whether we're hitting swap)
- *process* time since the request started. As each request is in a separate thread, the OS's system accounting can tell us whether 5 seconds of wall clock time was 5 seconds of CPU time, or 1 second of CPU time.
The canonical.
We are hitting many questions we cannot answer today as a result of not knowing these things.
Alternatively:
#RUSAGE_THREAD = 1 on my linux system - we'd want a C extension to get the right constant
resource.
should give us what we need.
Changed in launchpad: | |
importance: | Undecided → High |
Changed in launchpad: | |
importance: | Undecided → High |
status: | New → Triaged |
summary: |
- oops report should record information about the running process + oops report should record information about the running environment |
description: | updated |
description: | updated |
tags: |
added: oops-infrastructure removed: infrastructure oops-tools |
description: | updated |
summary: |
- oops report should record information about the running environment + oopses do not gather environmental data(load, thread-cpu-time, ...) |
Changed in python-oops: | |
status: | New → Triaged |
importance: | Undecided → High |
affects: | oops-tools → python-oops-tools |
AIUI Francis' team is in the best position to actually store this information, and he already has put work into capturing this data into data structures we can output in the OOPS dump.