"too many open files" during upgrade
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned | ||
2.3 |
Won't Fix
|
High
|
Unassigned | ||
2.4 |
Won't Fix
|
High
|
Unassigned |
Bug Description
Hi,
We upgraded an HA controller (3 machines) from 2.2.6 to 2.2.8 today.
The upgrade got apparently blocked because a machine agent was reaching its open file descriptor limit (see logs at the bottom). These logs repeated until we restarted it.
The maxfd limit was 20k, and the opened files were connections to mongodb.
Stats for this controller :
applications: 923
machines: 558
models: 175
units: 2449
Given these numbers, I don't believe we should have 20k TCP connections ever opened to mongodb at once.
Once this machine agent got restarted, the upgrade got unblocked and proceeded without issue.
Thanks
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 20ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 40ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:36216: EOF
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:50668: EOF
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:51260: EOF
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:33982: EOF
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms
2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms
2017-12-20 07:24:03 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms
tags: | added: upgrade-juju |
Changed in juju: | |
milestone: | none → 2.4-beta1 |
Changed in juju: | |
milestone: | 2.4-beta1 → none |
Changed in juju: | |
status: | Confirmed → Triaged |
Indeed, if there are 20k file handles open to Mongo something very odd is going on internally. The mgo socket pool that is meant to keep a certain number of connections open is limited to 4000 connections.
I can imagine that there are ways to create >1 pool, but we *shouldn't* inside of a single controller.