2.1: memory leak

Bug #1649719 reported by Jason Hobbs
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Tim Penhey
2.1
Fix Released
Critical
Tim Penhey
Ubuntu on IBM z Systems
Fix Released
Undecided
Unassigned

Bug Description

After running 2.1-beta2 in OIL for about 30 hours, monogodb memory usage has grown to 11GB and jujud to 2.6GB (RSS). They both started off under 1GB and the increase has been fairly steady through that time.

At this rate we will run out of memory on the controller node within the next day or two and will start having failed juju actions, and will have to reboot.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Tim Penhey (thumper) wrote :

Hi Jason,

I'd like to get some heap profiles from this machine.

If you juju ssh -m controller 0 (assuming machine 0 is an api server), and install the package "socat", you can then run the following command from a terminal:

  juju-heap-profile

Capture the output of that in a file.

Wait an hour, then run it again. Attach both files please.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.2.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's the two profiles requested - about 3 hours apart.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Any updates? We're still seeing this in beta3...

Changed in juju:
milestone: 2.2.0 → 2.1.0
milestone: 2.1.0 → 2.1-rc1
importance: High → Critical
Changed in juju:
importance: Critical → High
milestone: 2.1-rc1 → 2.2.0
milestone: 2.2.0 → 2.1.0
Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

Per conversation with Jason, this cannot be confirmed as a 2.1 specific regression. Therefore, I don't consider it a blocker for 2.1-rc1.

Revision history for this message
Larry Michel (lmic) wrote :

This is recreated with 2.1 beta3. I will attach more logs later.

Revision history for this message
Larry Michel (lmic) wrote :

I am collecting the juju-heap-profile info on the controller. I will attach logs later once data collection is completed.

Revision history for this message
Larry Michel (lmic) wrote :

I had a script to collect the juju-heap-profile every hour but it died for some reasons, so the profiles in the attached logs are 7 hours apart. The script to collect the meminfo did continue to run so that shows the memory every hour until that 7th hour.

Revision history for this message
Tim Penhey (thumper) wrote :

I'm looking at Jason's heap profile captures now.

Looks like there is a leak with: state.Watcher.sync. Looking into that now.

Revision history for this message
Tim Penhey (thumper) wrote :

It appears that something, somewhere is getting notified of a bunch of change events, and is storing the Change values. This means that the values aren't getting garbage collected.

Investigations are continuing.

Revision history for this message
Larry Michel (lmic) wrote :

This is juju-heap-profile and meminfo outpu with Juju2.1 beta4, taken ~2.5 hours apart.

Revision history for this message
Tim Penhey (thumper) wrote :

As discussed, can we also grab the juju-goroutines when we get the heap-profile? Cheers.

Revision history for this message
Tim Penhey (thumper) wrote :

I have another hypothesis, and have build a binary for the OIL lab to test. Please get in touch.

Ryan Beisner (1chb1n)
tags: added: s390x single-lpar uosci
Changed in ubuntu-z-systems:
status: New → Confirmed
Revision history for this message
Larry Michel (lmic) wrote :

@thumper,

Please see attached logs (4 readings of juju-heap-profile and juju-goroutine taken at 1+ hour intervals). This is for running with the binary that you provided and which was installed on the controller.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Confirmed → Triaged
Revision history for this message
Larry Michel (lmic) wrote :

additional logs on that same node later in the day.

Tim Penhey (thumper)
Changed in juju:
assignee: nobody → Tim Penhey (thumper)
status: Triaged → In Progress
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → In Progress
Revision history for this message
Tim Penhey (thumper) wrote :

Based on those goroutine profiles it looks like we are leaking a state instance somewhere.

Was anything happening on this controller between reading 4 and reading 6? Or was it mostly "stable" ?

Revision history for this message
Tim Penhey (thumper) wrote :

I also need to know how many models there were during these times. I see the number of state instances go up, but that may be due to the nature of the work.

Revision history for this message
Tim Penhey (thumper) wrote :

Just found a leak in the metrics manager. Will post new jujud up for testing.

Revision history for this message
Tim Penhey (thumper) wrote :

jujud up in same location as last time.

Revision history for this message
Anastasia (anastasia-macmood) wrote :
Revision history for this message
David Britton (dpb) wrote :

At least in our case, we are just running controller + default models. Larry probably has different numbers though.

Revision history for this message
Larry Michel (lmic) wrote :

@thumber, the sequence of events following the bootstrapping of the controller include starting as many CI pipelines as can be met by physical hardware available and environments available (which are the same as the model names). Hence, there's one model deployed per pipeline. For each model, there's a deploy phase, a cloud configuration phase and the various test. When one pipeline completes then the model is destroyed and the physical resources are returned into an available pool as is the environment (model name) which are then available for re-deployment. Lastly, note that the number of models may vary depending on size of the bundle, physical hardware available. In this particular case, the number of available CI environments is not a limitation as we haven't approached the max of 16. For the previous data, it was deploying ~4-5 concurrent models on average. As per IRC conversation, I'll include the juju models output.

Revision history for this message
Larry Michel (lmic) wrote :

@thumper here are logs taken at ~3 hours intervals with the new jujud.

Revision history for this message
Larry Michel (lmic) wrote :

@thumper our integration environment stalled yesterday afternoon due to an unrelated issue that got fixed this morning about 4-5 hours ago. So, iteration 5 may not be of great value since the builds had stopped but I'm including anyway. I've created a separate directory output2 where I'm collecting today's data. I'll attach iteration 7, 8 and 9 later tonight in a tarball that has output2 only. For time being, please have a look at iteration 6.

Larry Michel (lmic)
tags: added: cdo-qa-blocker
Revision history for this message
Larry Michel (lmic) wrote :

I was not able to collect the data overnight but here's data from this morning including logsink.log

Revision history for this message
Larry Michel (lmic) wrote :

Attaching latest data with 2.1 beta5. 2nd reading is after 6 hours and last reading after 21 hours. I'll reboot controllers shortly and get additional readings at more consistent intervals.

Revision history for this message
Tim Penhey (thumper) wrote :
Changed in juju:
milestone: 2.1.0 → 2.1-rc1
milestone: 2.1-rc1 → 2.2.0-alpha1
assignee: Tim Penhey (thumper) → nobody
status: In Progress → Triaged
Tim Penhey (thumper)
Changed in juju:
importance: High → Critical
assignee: nobody → Tim Penhey (thumper)
status: Triaged → In Progress
Revision history for this message
Tim Penhey (thumper) wrote :

Most recent part: https://github.com/juju/juju/pull/6949

I'm going to work with Chris Lee with a long running test he has that exhibits the same behaviour to see if we can identify the state leaks.

Revision history for this message
Larry Michel (lmic) wrote :

Latest logs attached. Intervals of 6, 8 and 10 hours.

Revision history for this message
Tim Penhey (thumper) wrote :

Unfortunately the log running test we have doesn't exhibit the same behaviour as the OIL CI lab.

When 2.1-rc1 comes out, can I get the output from the following two commands added to the data collected on the controller machine-0?

  juju-statepool-report
  juju-statetracker-report

Hopefully this will shed some light on to the leaking state objects - which is where I think the memory is going.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am re-targeting this to 2.1.0 milestone as we want to get a result from a run with 2.1-rc1.

Revision history for this message
Larry Michel (lmic) wrote :

@thumper, here are requested logs for RC1 at 1 hour intervals (1st 3).

Revision history for this message
Larry Michel (lmic) wrote : Re: Custom 2.1-rc2 for OIL CI
Download full text (4.2 KiB)

Please see the attached logs Tim. I think it's the logsink.log and
machine.log that are most important and they're included. For the remaining
logs, please note that I was expecting 5 runs but script died after 2nd
run. So, run 1 and 2 are one hour apart, and last run is some 3 hours
later. The juju models did run every hour though. ​
 logs.tar.gz
<https://drive.google.com/a/canonical.com/file/d/0BxsyAMSSDhWcZEJmUlZSekNOZ3M/view?usp=drive_web>

On Mon, Feb 13, 2017 at 7:25 PM, Larry Michel wrote:

> Hi Tim,
>
> I've installed the new jujud:
>
> root@juju-controller-vm:/var/lib/juju/tools/2.1-rc1-xenial-amd64# cp
> /home/ubuntu/jujud.rc1 jujud
> root@juju-controller-vm:/var/lib/juju/tools/2.1-rc1-xenial-amd64# ls -l
> total 241260
> -rw-r--r-- 1 root root 219 Feb 13 16:47 downloaded-tools.txt
> -rwxr-xr-x 1 root root 123483848 Feb 14 01:09 jujud
> -rwxrwxr-x 1 lxd uuidd 123556764 Feb 10 12:29 jujud.ORIG
> root@juju-controller-vm:/var/lib/juju/tools/2.1-rc1-xenial-amd64# sum *
> 38043 1 downloaded-tools.txt
> 00334 120590 jujud
> 16681 120661 jujud.ORIG
> root@juju-controller-vm:/var/lib/juju/tools/2.1-rc1-xenial-amd64# sum
> jujud*
> 00334 120590 jujud
> 16681 120661 jujud.ORIG
> root@juju-controller-vm:/var/lib/juju/tools/2.1-rc1-xenial-amd64# sync
> root@juju-controller-vm:/var/lib/juju/tools/2.1-rc1-xenial-amd64# sync
> root@juju-controller-vm:/var/lib/juju/tools/2.1-rc1-xenial-amd64# sudo
> reboot
> Connection to juju-controller-vm.oilstaging closed by remote host.
> Connection to juju-controller-vm.oilstaging closed.
> ubuntu@maas2-integration:~/lmic$ ssh juju-controller-vm.oilstaging
> ssh: connect to host juju-controller-vm.oilstaging port 22: Connection
> refused
> ubuntu@maas2-integration:~/lmic$ ssh juju-controller-vm.oilstaging
> Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-62-generic x86_64)
>
> * Documentation: https://help.ubuntu.com
> * Management: https://landscape.canonical.com
> * Support: https://ubuntu.com/advantage
>
> Get cloud support with Ubuntu Advantage Cloud Guest:
> http://www.ubuntu.com/business/services/cloud
>
> 8 packages can be updated.
> 6 updates are security updates.
>
>
> Last login: Tue Feb 14 01:09:09 2017 from 10.244.192.10
> ubuntu@juju-controller-vm:~$ sum /var/lib/juju/tools/2.1-rc1-xe
> nial-amd64/juju*
> 00334 120590 /var/lib/juju/tools/2.1-rc1-xenial-amd64/jujud
> 16681 120661 /var/lib/juju/tools/2.1-rc1-xenial-amd64/jujud.ORIG
> ubuntu@juju-controller-vm:~$
>
> and this is initial state:
>
> ubuntu@juju-a6a70b-integration-juju2-1:~$ sudo su - jenkins
> jenkins@juju-a6a70b-integration-juju2-1:~$ juju models
> Controller: maas
>
> Model Cloud/Region Status Machines Cores Access Last
> connection
> controller maas available 1 8 admin just now
> default maas available 0 - admin 8 hours ago
>
> I've restarted the OIL integration queue and this was the start of the log:
>
> jenkins@juju-a6a70b-integration-juju2-1:~$ juju debug-log -m controller
> --include-module=juju.apiserver.logsink
> machine-0: 02:56:02 DEBUG juju.apiserver.logsink [0xc8209514a0] new
> logsink for 3c821461:machine-0 from 127....

Read more...

Revision history for this message
Larry Michel (lmic) wrote :

Tim, here are the new logs with juju models --uuid. Note that the
jujumodels files are indexed starting at 1 and the controller logs start at
index 3. Timestamps should align.

On Tue, Feb 14, 2017 at 1:01 AM, Larry Michel <email address hidden>
wrote:

> Please see the attached logs Tim. I think it's the logsink.log and
> machine.log that are most important and they're included. For the remaining
> logs, please note that I was expecting 5 runs but script died after 2nd
> run. So, run 1 and 2 are one hour apart, and last run is some 3 hours
> later. The juju models did run every hour though. ​
> logs.tar.gz
> <https://drive.google.com/a/canonical.com/file/d/0BxsyAMSSDhWcMU83YjdURVJ3MUU/view?usp=drive_web>
> ​
>
>

Revision history for this message
Tim Penhey (thumper) wrote : Re: 2.1-beta2: memory leak

Thanks for this.

I have located the source of the leak, and I think it is an upstream library error.

We have considered changing our websocket library for a while, and this may well be a good reason. However we won't be changing the library this close to the 2.1.0 release.

I'll be looking at changing it in the 2.2 branch. It seems that the 2.2 release will be late March.

The leak at this stage appears to happen approximately 50% of the time we destroy a model, and the leak is relatively small.

Revision history for this message
Tim Penhey (thumper) wrote :

I'm going to mark the 2.1 series as fix committed as I don't think we'll be getting any more fixes in for 2.1 at this stage. But I'll leave the 2.2 series open and in progress.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I'm marking this back to 'Confirmed' for the 2.1 release because it has not been fixed. It may be a little improved, but we still have to reboot our controller every 4 days due to memory exhaustion, which is about where we've always been. I understand you may not be able to address it in this release, but it's inaccurate to call it fixed.

David Britton (dpb)
summary: - 2.1-beta2: memory leak
+ 2.1: memory leak
Revision history for this message
Tim Penhey (thumper) wrote :

I'm pretty sure I know what this is now. And it appears to be the same underlying issue in bug 1660087. The logsink endpoints are not noticing when the agents go away. The current websocket library (x/net/websocket) does not give access to the underlying ping/pong frames of the websocket protocol, whereas the gorilla/websocket library does.

We need some form of alive detection added to the logsink endpoint.

Paul Gear (paulgear)
tags: added: canonical-is
Revision history for this message
Uros Jovanovic (uros-jovanovic) wrote :

Jason, do you still see the same amount of mem leaking in final 2.1 as stated in the bug "After running 2.1-beta2 in OIL for about 30 hours, monogodb memory usage has grown to 11GB and jujud to 2.6GB (RSS). They both started off under 1GB and the increase has been fairly steady through that time."

Or is at least a bit better now? What kind of memory usage are you seeing now after 30h run?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1649719] Re: 2.1: memory leak

I think that it is a little better - I have a controller that's been up 40
hours, mongod is at 7.2GB and jujud is at about 1GB.

Jason

On Thu, Feb 23, 2017 at 3:54 AM, Uros Jovanovic <
<email address hidden>> wrote:

> Jason, do you still see the same amount of mem leaking in final 2.1 as
> stated in the bug "After running 2.1-beta2 in OIL for about 30 hours,
> monogodb memory usage has grown to 11GB and jujud to 2.6GB (RSS). They
> both started off under 1GB and the increase has been fairly steady
> through that time."
>
> Or is at least a bit better now? What kind of memory usage are you
> seeing now after 30h run?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1649719
>
> Title:
> 2.1: memory leak
>
> Status in juju:
> In Progress
> Status in juju 2.1 series:
> Confirmed
> Status in Ubuntu on IBM z Systems:
> In Progress
>
> Bug description:
> After running 2.1-beta2 in OIL for about 30 hours, monogodb memory
> usage has grown to 11GB and jujud to 2.6GB (RSS). They both started
> off under 1GB and the increase has been fairly steady through that
> time.
>
> At this rate we will run out of memory on the controller node within
> the next day or two and will start having failed juju actions, and
> will have to reboot.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1649719/+subscriptions
>

Revision history for this message
Anastasia (anastasia-macmood) wrote :

We've done everything we could for 2.1. I am marking this as Fix Released as 2.1GA, aka 2.1.0, has been released. Remaining minor memory leak cannot be addressed in a point release and as such no related work will be done 2.1.x

Further improvement to memory footprint are being addressed in the next release, 2.2, and is related to websocket library change as per comment # 38.

Revision history for this message
Uros Jovanovic (uros-jovanovic) wrote :

Is mongo growth correlated to size of the logs stored in this time?

Tim Penhey (thumper)
Changed in juju:
milestone: 2.2-alpha1 → 2.2-alpha2
Chris Gregan (cgregan)
tags: added: gm-blocker
Chris Gregan (cgregan)
tags: added: rc-blocker
removed: gm-blocker
Revision history for this message
Tim Penhey (thumper) wrote :

Found and fixed the underlying last (that we know of) leak
  https://github.com/juju/juju/pull/7134

Tim Penhey (thumper)
Changed in juju:
status: In Progress → Fix Committed
Alvaro Uria (aluria)
tags: added: canonical-bootstack
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Released
Felipe Reyes (freyes)
tags: added: sts
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.