high load during startup, goes away when controllers get restarted
Bug #1727973 reported by
Junien Fridrick
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Tim Penhey | ||
2.2 |
Fix Released
|
High
|
Tim Penhey |
Bug Description
Hi,
Multiple times over the last few months, we've experienced issues with juju controllers startup.
The symptoms are that following a full restart of the controllers (HA with 3), there is a very high load on the mongodb primary (we're talking ~180 here). juju status is super slow, etc.
This load appears to remain high until we restart all the controllers and/or switch the mongodb primary.
We're going to track occurrences of this behaviour here until it is resolved.
We're currently running juju 2.2.4.
Thanks
tags: | added: canonical-is |
Changed in juju: | |
milestone: | none → 2.3-beta3 |
importance: | Undecided → High |
status: | New → Triaged |
Changed in juju: | |
assignee: | nobody → John A Meinel (jameinel) |
Changed in juju: | |
assignee: | John A Meinel (jameinel) → nobody |
milestone: | 2.3-beta3 → none |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Occurrence of 2017-10-26 :
Startup (start of high load) : around 12:30 UTC
Restart (end of high load) : around 14:00 UTC
Following a full mgopurge to fix bug 1727679, the controllers are restarted. The load stays at ~180 for 1h, after what I try to kill -STOP / kill -CONT the jujuds, which just makes them restart, in turn making a mongodb primary failover. The load gets back to normal after this failover.
Observations :
* very high activity in txns.log (as expected)
* no Juju internal metric, sadly. We need to understand why (timeout when trying to fetch ? unavailable ? something to investigate if the problem re-appears)
* during the "high load" time :
- mongodb opened cursors stayed up super high, at 500. We need to understand why.
- mongodb repl_apply_batches was at 0. Once again something that we need to understand.
- disk throughput kind of low
- very high write lock acquire time (~1.5s) for about 10 min. Then another 5min "spike" at ~600 ms 10 min later.