Failed worker can result in large number of goroutines and open socket connections and eventually gets picked on by the OOM killer
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Fix Released
|
High
|
Andrew McDermott | ||
1.24 |
Fix Released
|
High
|
Andrew Wilkins | ||
1.25 |
Fix Released
|
High
|
Andrew McDermott |
Bug Description
If the updatetools worker fails jujud consumes a lot of memory and a lot of open file descriptors; the lots of memory comes from the many goroutines that never exit.
To reproduce:
1) Bootstrap on EC2 using 1.26 - though this probably applies to earlier versions too.
My local build was based on:
$ jujud version
1.26-
$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
aim@spicy:~/juju
$ git rev-parse HEAD
3cd87e6264428
On the bootstrap machine run:
$ netstat -an | grep CLOSE_WAIT
$ netstat -an | wc -l
and notice a) all the connections in CLOSE_WAIT and b) that the count goes up every 5-10 seconds.
Looking through /var/log/
2015-09-17 00:45:37 ERROR juju.worker runner.go:223 exited "toolsversionch
Running the following shell snippet counts the number of connections to s3-1-w.
$ while :; do a=$(pgrep jujud); b=$(date +%T); n=$(sudo lsof -p $a |grep s3 | wc -l); echo "$b: pid=$a count=$n"; sleep 2; done
I left this running over night and the count went to over 6000 and the memory usage was >30% of the available memory (as reported by htop).
Sending SIGQUIT to the juju process and counting the number of active goroutines yields:
$ grep goroutine ~/machine-0.log |wc -l
4401
which accounts for the large memory usage.
Changed in juju-core: | |
importance: | Undecided → High |
status: | New → Triaged |
milestone: | none → 1.26-alpha1 |
assignee: | nobody → Andrew McDermott (frobware) |
status: | Triaged → In Progress |
Changed in juju-core: | |
status: | In Progress → Fix Committed |
tags: | added: cloud-installer landscape |
tags: | added: canonical-bootstack |
Changed in juju-core: | |
status: | Fix Committed → Fix Released |
I have a proposed fix that I am currently testing at:
https:/ /github. com/frobware/ juju/tree/ fix-http- response- memory- leak