Failed worker can result in large number of goroutines and open socket connections and eventually gets picked on by the OOM killer

Bug #1496750 reported by Andrew McDermott
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew McDermott
1.24
Fix Released
High
Andrew Wilkins
1.25
Fix Released
High
Andrew McDermott

Bug Description

If the updatetools worker fails jujud consumes a lot of memory and a lot of open file descriptors; the lots of memory comes from the many goroutines that never exit.

To reproduce:

1) Bootstrap on EC2 using 1.26 - though this probably applies to earlier versions too.

My local build was based on:

  $ jujud version
  1.26-alpha1-trusty-amd64

  $ git checkout master
  Switched to branch 'master'
  Your branch is up-to-date with 'origin/master'.
  aim@spicy:~/juju
  $ git rev-parse HEAD
  3cd87e6264428ae00a44fcead47a9ff2bbd1ef34

On the bootstrap machine run:

  $ netstat -an | grep CLOSE_WAIT
  $ netstat -an | wc -l

and notice a) all the connections in CLOSE_WAIT and b) that the count goes up every 5-10 seconds.

Looking through /var/log/juju/machine-0.log I see the following message repeated every 5-10 seconds:

  2015-09-17 00:45:37 ERROR juju.worker runner.go:223 exited "toolsversionchecker": cannot update tools information: cannot get latest version: canot find available tools: no matching tools available

Running the following shell snippet counts the number of connections to s3-1-w.amazonaws.com, which is a CNAME for juju-dist.s3.amazonaws.com and is a request triggered by the toolsversionchecker worker.

  $ while :; do a=$(pgrep jujud); b=$(date +%T); n=$(sudo lsof -p $a |grep s3 | wc -l); echo "$b: pid=$a count=$n"; sleep 2; done

I left this running over night and the count went to over 6000 and the memory usage was >30% of the available memory (as reported by htop).

Sending SIGQUIT to the juju process and counting the number of active goroutines yields:

  $ grep goroutine ~/machine-0.log |wc -l
  4401

which accounts for the large memory usage.

Revision history for this message
Andrew McDermott (frobware) wrote :
Revision history for this message
Andrew McDermott (frobware) wrote :

I have a proposed fix that I am currently testing at:

  https://github.com/frobware/juju/tree/fix-http-response-memory-leak

Changed in juju-core:
importance: Undecided → High
status: New → Triaged
milestone: none → 1.26-alpha1
assignee: nobody → Andrew McDermott (frobware)
status: Triaged → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Andrew McDermott (frobware) wrote :
tags: added: cloud-installer landscape
JuanJo Ciarlante (jjo)
tags: added: canonical-bootstack
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.