processModelRemovals stuck in a loop
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Incomplete
|
Low
|
John A Meinel |
Bug Description
I was testing out the patch to fix mongo SASL restarting by using 'tcpkill' to break mongo connections, and see if the patch actually drops the CPU load of Juju.
While doing that (with the patch), I saw that it still pegged 1 CPU at 100%.
I used introspection to try and figure out what was going on, and it seems to say that we're spending a *lot* of CPU in a select loop as part of processModelRem
Showing top 10 nodes out of 68
flat flat% sum% cum cum%
35.18s 27.71% 27.71% 79.31s 62.47% runtime.selectgo /snap/go/
15.19s 11.96% 39.67% 15.19s 11.96% runtime.lock /snap/go/
13.89s 10.94% 50.61% 13.89s 10.94% runtime.unlock /snap/go/
11.35s 8.94% 59.55% 13.17s 10.37% runtime.selectrecv /snap/go/
7.20s 5.67% 65.23% 7.28s 5.73% sync.(*
6.23s 4.91% 70.13% 6.26s 4.93% sync.(*Mutex).Lock /snap/go/
5.25s 4.14% 74.27% 19.14s 15.08% runtime.selunlock /snap/go/
4.62s 3.64% 77.91% 19.78s 15.58% runtime.sellock /snap/go/
3.55s 2.80% 80.70% 119s 93.73% github.
2.85s 2.24% 82.95% 2.85s 2.24% runtime.selectgo /snap/go/
vs
flat flat% sum% cum cum%
3.55s 2.80% 2.80% 119s 93.73% github.
0 0% 2.80% 119s 93.73% github.
35.18s 27.71% 30.51% 79.31s 62.47% runtime.selectgo /snap/go/
4.62s 3.64% 34.14% 19.78s 15.58% runtime.sellock /snap/go/
5.25s 4.14% 38.28% 19.14s 15.08% runtime.selunlock /snap/go/
1.47s 1.16% 39.44% 16.55s 13.04% gopkg.in/
15.19s 11.96% 51.40% 15.19s 11.96% runtime.lock /snap/go/
1.69s 1.33% 52.73% 15.08s 11.88% gopkg.in/
13.89s 10.94% 63.67% 13.89s 10.94% runtime.unlock /snap/go/
11.35s 8.94% 72.61% 13.17s 10.37% runtime.selectrecv /snap/go/
And I'll also include the SVG.
I don't know what the issue is. Is it waking up the loop over and over again? The actual code is quite small, so I'm not sure what is going on.
I'll try to add some debugging and see if I can trigger it again.
It might just be that I exactly killed the connection underlying connection and that causes it to go into a tailspin.
I wonder if something is causing the Changes() channel to just always return the full models list over and over again? (only everything is still alive, so it just spins).
Still investigating.
This is the SVG of cpu time spent.