agents see "too many open files" errors after many failed API attempts
Bug #1420057 reported by
Menno Finlay-Smits
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Fix Released
|
High
|
Dave Cheney | ||
1.22 |
Fix Released
|
Critical
|
Dave Cheney | ||
1.23 |
Fix Released
|
Critical
|
Dave Cheney | ||
1.24 |
Fix Released
|
Critical
|
Dave Cheney |
Bug Description
While investigating a customer OpenStack deployment managed by Juju I noticed that many unit and machine agents were failing due to file handle exhaustion ("too many open files") after many failed connections to the (broken) Juju state servers. These agents weren't able to reconnect until they were manually restarted.
My guess is that a failed API connection attempt leaks at least one file handle (but this is just a guess at this stage). It looks like it took about 2 days of failed connection attempts before file handles were exhausted.
The issue was seen with Juju 1.20.9 but it is likely that it's still there in more recent versions.
Changed in juju-core: | |
status: | New → Triaged |
Changed in juju-core: | |
milestone: | none → 1.21.3 |
Changed in juju-core: | |
milestone: | 1.21.3 → 1.23 |
no longer affects: | juju-core/1.21 |
no longer affects: | juju-core/1.22 |
Changed in juju-core: | |
milestone: | 1.23 → 1.24-alpha1 |
Changed in juju-core: | |
milestone: | 1.24-alpha1 → 1.24.0 |
Changed in juju-core: | |
milestone: | 1.24.0 → 1.25.0 |
tags: | added: cpce |
tags: | removed: cpce |
Changed in juju-core: | |
assignee: | nobody → Cheryl Jennings (cherylj) |
Changed in juju-core: | |
assignee: | Cheryl Jennings (cherylj) → Dave Cheney (dave-cheney) |
Changed in juju-core: | |
status: | Triaged → Fix Committed |
Changed in juju-core: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Took a look at this one this afternoon. I tried to get the API client to leak sockets in a simple standalone main, to no avail. I'm not convinced it was the API client connections that leaked fds here.
Seems suspicious to me that there were a large number of prior mongo errors that got us to the point of exhausting fds though.
What was the mongo configuration on this state server like? Was it clustered?