high load during startup, goes away when controllers get restarted

Bug #1727973 reported by Junien Fridrick
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey
2.2
Fix Released
High
Tim Penhey

Bug Description

Hi,

Multiple times over the last few months, we've experienced issues with juju controllers startup.

The symptoms are that following a full restart of the controllers (HA with 3), there is a very high load on the mongodb primary (we're talking ~180 here). juju status is super slow, etc.

This load appears to remain high until we restart all the controllers and/or switch the mongodb primary.

We're going to track occurrences of this behaviour here until it is resolved.

We're currently running juju 2.2.4.

Thanks

Tags: canonical-is
Junien Fridrick (axino)
tags: added: canonical-is
Revision history for this message
Junien Fridrick (axino) wrote :

Occurrence of 2017-10-26 :

Startup (start of high load) : around 12:30 UTC
Restart (end of high load) : around 14:00 UTC

Following a full mgopurge to fix bug 1727679, the controllers are restarted. The load stays at ~180 for 1h, after what I try to kill -STOP / kill -CONT the jujuds, which just makes them restart, in turn making a mongodb primary failover. The load gets back to normal after this failover.

Observations :

* very high activity in txns.log (as expected)
* no Juju internal metric, sadly. We need to understand why (timeout when trying to fetch ? unavailable ? something to investigate if the problem re-appears)
* during the "high load" time :
 - mongodb opened cursors stayed up super high, at 500. We need to understand why.
 - mongodb repl_apply_batches was at 0. Once again something that we need to understand.
 - disk throughput kind of low
 - very high write lock acquire time (~1.5s) for about 10 min. Then another 5min "spike" at ~600 ms 10 min later.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.3-beta3
importance: Undecided → High
status: New → Triaged
Tim Penhey (thumper)
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
Tim Penhey (thumper)
Changed in juju:
assignee: John A Meinel (jameinel) → nobody
milestone: 2.3-beta3 → none
Revision history for this message
Tim Penhey (thumper) wrote :

This one is actually really hard to reproduce. Can I request that next time an upgrade is needed we can get a Juju dev to watch while it is happening?

Ideally we'd want to get mongotop running during the upgrade to see what is happening there.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1727973] Re: high load during startup, goes away when controllers get restarted

I do believe that this one showed unusually high activity in txns.log
during the load time. And that you could see a significant difference in
static load on the system once the controller was restarted one more time.
(Something like a static 500 active cursors dropping down to 10-20 that
only spiked up to around 100 or so.)

On Thu, Nov 9, 2017 at 8:01 AM, Tim Penhey <email address hidden> wrote:

> This one is actually really hard to reproduce. Can I request that next
> time an upgrade is needed we can get a Juju dev to watch while it is
> happening?
>
> Ideally we'd want to get mongotop running during the upgrade to see what
> is happening there.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1727973
>
> Title:
> high load during startup, goes away when controllers get restarted
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1727973/+subscriptions
>

Revision history for this message
Tim Penhey (thumper) wrote :

I have a feeling that this is closely related to bug 1733708, in that we very inefficiently have watchers looking at the txns.log collection iterating transactions for every model in each HA controller API server.

Revision history for this message
Tim Penhey (thumper) wrote :

How many models are in the controller?

Revision history for this message
Paul Collins (pjdc) wrote :

176

Revision history for this message
Tim Penhey (thumper) wrote :

Yep, this will be the same problem as the bug referenced above.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Tim Penhey (thumper)
milestone: none → 2.3.2
Revision history for this message
Paul Gear (paulgear) wrote :

I believe this to still be happening on 2.2.8 when controllers are restarted after an mgopurge. I'll continue to monitor on our main controller cluster.

Revision history for this message
Paul Gear (paulgear) wrote :

I've just performed a full stop, mgopurge, and start on our 3-node 2.2.8 controller and was unable to replicate the issue.

Tim Penhey (thumper)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.