processModelRemovals stuck in a loop

Bug #1748781 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Low
John A Meinel

Bug Description

I was testing out the patch to fix mongo SASL restarting by using 'tcpkill' to break mongo connections, and see if the patch actually drops the CPU load of Juju.
While doing that (with the patch), I saw that it still pegged 1 CPU at 100%.

I used introspection to try and figure out what was going on, and it seems to say that we're spending a *lot* of CPU in a select loop as part of processModelRemovals (which has changed in 2.4).

Showing top 10 nodes out of 68
      flat flat% sum% cum cum%
    35.18s 27.71% 27.71% 79.31s 62.47% runtime.selectgo /snap/go/1404/src/runtime/select.go
    15.19s 11.96% 39.67% 15.19s 11.96% runtime.lock /snap/go/1404/src/runtime/lock_futex.go
    13.89s 10.94% 50.61% 13.89s 10.94% runtime.unlock /snap/go/1404/src/runtime/lock_futex.go
    11.35s 8.94% 59.55% 13.17s 10.37% runtime.selectrecv /snap/go/1404/src/runtime/select.go
     7.20s 5.67% 65.23% 7.28s 5.73% sync.(*Mutex).Unlock /snap/go/1404/src/sync/mutex.go
     6.23s 4.91% 70.13% 6.26s 4.93% sync.(*Mutex).Lock /snap/go/1404/src/sync/mutex.go
     5.25s 4.14% 74.27% 19.14s 15.08% runtime.selunlock /snap/go/1404/src/runtime/select.go
     4.62s 3.64% 77.91% 19.78s 15.58% runtime.sellock /snap/go/1404/src/runtime/select.go
     3.55s 2.80% 80.70% 119s 93.73% github.com/juju/juju/apiserver.(*Server).processModelRemovals /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go
     2.85s 2.24% 82.95% 2.85s 2.24% runtime.selectgo /snap/go/1404/src/runtime/stubs.go

vs

      flat flat% sum% cum cum%
     3.55s 2.80% 2.80% 119s 93.73% github.com/juju/juju/apiserver.(*Server).processModelRemovals /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go
         0 0% 2.80% 119s 93.73% github.com/juju/juju/apiserver.(*Server).run.func5 /home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go
    35.18s 27.71% 30.51% 79.31s 62.47% runtime.selectgo /snap/go/1404/src/runtime/select.go
     4.62s 3.64% 34.14% 19.78s 15.58% runtime.sellock /snap/go/1404/src/runtime/select.go
     5.25s 4.14% 38.28% 19.14s 15.08% runtime.selunlock /snap/go/1404/src/runtime/select.go
     1.47s 1.16% 39.44% 16.55s 13.04% gopkg.in/tomb%2ev1.(*Tomb).Dying /home/jameinel/dev/go/src/gopkg.in/tomb.v1/tomb.go
    15.19s 11.96% 51.40% 15.19s 11.96% runtime.lock /snap/go/1404/src/runtime/lock_futex.go
     1.69s 1.33% 52.73% 15.08s 11.88% gopkg.in/tomb%2ev1.(*Tomb).init /home/jameinel/dev/go/src/gopkg.in/tomb.v1/tomb.go
    13.89s 10.94% 63.67% 13.89s 10.94% runtime.unlock /snap/go/1404/src/runtime/lock_futex.go
    11.35s 8.94% 72.61% 13.17s 10.37% runtime.selectrecv /snap/go/1404/src/runtime/select.go

And I'll also include the SVG.

I don't know what the issue is. Is it waking up the loop over and over again? The actual code is quite small, so I'm not sure what is going on.
I'll try to add some debugging and see if I can trigger it again.

It might just be that I exactly killed the connection underlying connection and that causes it to go into a tailspin.
I wonder if something is causing the Changes() channel to just always return the full models list over and over again? (only everything is still alive, so it just spins).

Still investigating.

Revision history for this message
John A Meinel (jameinel) wrote :

This is the SVG of cpu time spent.

Revision history for this message
John A Meinel (jameinel) wrote :

I wasn't able to reproduce this, I think I had to get lucky with what connection was killed, and even then I'm not sure what was going on.

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1748781] Re: processModelRemovals stuck in a loop

So looking through the code, one thing comes to mind. If the channel under
Changes() was closed, then I think the loop would wake up, but only see an
empty list, and thus would issue no Get() requests and thus never find an
error and just loop calling Changes() over and over.
Theoretically if Changes was closed it should be trying to kill the tomb so
the Dying side should trigger and shut down. But if we missed that, there
is nothing in this loop that would notice an error.

My main concern is if this is actually a general design failure and we need
to be changing our loops from

select
...
case val <-Changes():

to
case val, ok <-Changes():
if !ok {
  return Error("changes channel closed")

John
=:->

On Feb 11, 2018 23:30, "John A Meinel" <email address hidden> wrote:

> This is the SVG of cpu time spent.
>
>
> ** Attachment added: "profile002.svg"
> https://bugs.launchpad.net/juju/+bug/1748781/+attachment/
> 5053182/+files/profile002.svg
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1748781
>
> Title:
> processModelRemovals stuck in a loop
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1748781/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

I was able to reproduce this today with Juju 2.3.4 while following the steps from here:
https://bugs.launchpad.net/juju/+bug/1747368/comments/2

(essentially, it seems that having a connection to mongo die is causing this loop to get hung.)

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.