"too many open files" during upgrade

Bug #1739380 reported by Junien Fridrick
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned
2.3
Won't Fix
High
Unassigned
2.4
Won't Fix
High
Unassigned

Bug Description

Hi,

We upgraded an HA controller (3 machines) from 2.2.6 to 2.2.8 today.

The upgrade got apparently blocked because a machine agent was reaching its open file descriptor limit (see logs at the bottom). These logs repeated until we restarted it.

The maxfd limit was 20k, and the opened files were connections to mongodb.

Stats for this controller :
applications: 923
machines: 558
models: 175
units: 2449

Given these numbers, I don't believe we should have 20k TCP connections ever opened to mongodb at once.

Once this machine agent got restarted, the upgrade got unblocked and proceeded without issue.

Thanks

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 20ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 40ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:36216: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:50668: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:51260: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: TLS handshake error from <IP>:33982: EOF

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

2017-12-20 07:24:02 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 5ms

2017-12-20 07:24:03 WARNING juju.apiserver log.go:168 http: Accept error: accept tcp [::]:17070: accept4: too many open files; retrying in 10ms

Revision history for this message
John A Meinel (jameinel) wrote :

Indeed, if there are 20k file handles open to Mongo something very odd is going on internally. The mgo socket pool that is meant to keep a certain number of connections open is limited to 4000 connections.

I can imagine that there are ways to create >1 pool, but we *shouldn't* inside of a single controller.

Changed in juju:
importance: Undecided → High
status: New → Triaged
Tim Penhey (thumper)
tags: added: upgrade-juju
Revision history for this message
Tim Penhey (thumper) wrote :

Did a quick search of the state package on the 2.2 branch. The following functions aren't closing the iterator:

state/clouds.go:

func (st *State) Clouds()

state/cloudcredentials.go

func (st *State) CloudCredentials(user names.UserTag, cloudName string)

state/lease/client.go

func (client *client) readEntries // doesn't close iterator if leaseDoc.entry() returns an error

state/prune.go

The deleteInBatches function doesn't close the iterator, nor do either of the call sites.

state/upgrades.go

func updateLegacyLXDCredentialsOps(...)
func UpgradeNoProxyDefaults(st *State) error
func RemoveNilValueApplicationSettings(st *State) error

Revision history for this message
Tim Penhey (thumper) wrote :

I'm guessing, quite wildly I must add, that during the upgrade process, we get the credentails for every model for potential model upgrades.

However, all of these are worrying and will cause problems.

The pruner also not closing the iterator will leak over time causing lost file handles.

Revision history for this message
Andrew Wilkins (axwalk) wrote :
Tim Penhey (thumper)
Changed in juju:
milestone: none → 2.4-beta1
Revision history for this message
Tim Penhey (thumper) wrote :

Unfortunately this also happened with 2.2.9, so it wasn't the iterators, but is something else.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.5 KiB)

For the file handle issues, digging through the log file from bug #1742604

The controller was happily started here:
2018-01-11 02:25:53 INFO juju.cmd supercommand.go:63 running jujud [2.2.9 gc go1.8]

Things generally seem happy, with model workers coming up and connecting correctly.

Up until we get to here:

2018-01-11 02:26:13 INFO juju.apiserver.common password.go:105 setting password for "machine-1"
2018-01-11 02:26:19 INFO juju.apiserver.connection request_notifier.go:80 agent login: machine-1 for dd562d7d-512e-4241-83a1-3a64cd84439b
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46598->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46592->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:20 INFO juju.state.pool.txnwatcher txnwatcher.go:135 watcher loop failed: watcher iteration error: Closed explicitly

The TXN watcher suddenly gets a tcp read failure trying to read from mongo, and that causes its connection to get reset, and flagged as "Closed explicitly".

Now, I don't know if that is mongo closing the connection, something we did on our end, or something else.

But after that, we get a *flurry* of:
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "txnlog", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "txnlog": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "txnlog" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "leadership", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "leadership": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "leadership" in 1s
2018-01-11 02:26:20 INFO juju.state.presence presence.go:186 watcher loop failed: Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:483 stopped "presence", err: Closed explicitly
2018-01-11 02:26:20 ERROR juju.worker runner.go:392 exited "presence": Closed explicitly
2018-01-11 02:26:20 INFO juju.worker runner.go:467 restarting "presence" in 1s

Which then proceeds onward until we get to:
2018-01-11 02:26:21 INFO juju.worker runner.go:477 start "txnlog"
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38931->10.25.2.109:37017: read: connection reset by peer
2018-01-11 02:26:24 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:38909->10.25.2.109:37017: read: connection reset by peer
...
and
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 10.25.2.109:37017: i/o timeout
2018-01-11 02:26:25 WARNING juju.mongo open.go:159 mongodb connection failed...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

As a summary comment, my first guess is that if we get an error during TLS negotation, we are failing to close the underlying socket.

Revision history for this message
John A Meinel (jameinel) wrote :

Looking at the code to dial Mongo, I think it is correct:
  c, err := net.DialTimeout("tcp", addr, opts.Timeout)
  if err != nil {
   logger.Warningf("mongodb connection failed, will retry: %v", err)
   return nil, err
  }
  if tlsConfig != nil {
   cc := tls.Client(c, tlsConfig)
   if err := cc.Handshake(); err != nil {
    logger.Warningf("TLS handshake failed: %v", err)
    if err := c.Close(); err != nil {
     logger.Warningf("failed to close connection: %v", err)
    }
    return nil, err
   }
   c = cc
  }

The only question is if DialTimeout returns an error, would you ever need to close the connection? I would assume that 'c' is nil at that point.
And the error we're seeing in the log file is:
2018-01-11 02:26:20 WARNING juju.mongo open.go:165 TLS handshake failed: read tcp 10.25.2.111:46598->10.25.2.109:37017: read: connection reset by peer

"handshake failed"

looks like it should be closing the file handle correctly.

The other error that we are seeing, though, is "TLS handshake error" coming from the http package. Which is more the API server layer, than the Mongo layer.

I haven't tracked down where those are actually coming from. What is really weird is that it is saying it is happening in "log.go line 168", but we don't have a 'log.go' file in the Juju code base.

I do see several places that are creating an "apiserver" logger:
./juju/apiserver/apiserver.go:48:var logger = loggo.GetLogger("juju.apiserver")
./juju/cmd/jujud/agent/machine.go:1509: logger := loggo.GetLogger("juju.apiserver")

And the issue with the latter one is that it is part of the observer factories, thus is likely to be passed to something else.

I haven't found any obvious log.go files that are using a logger object on line 168.

Changed in juju:
milestone: 2.4-beta1 → none
Revision history for this message
Tim Penhey (thumper) wrote :

We haven't heard of this problem for a while, it is possible that it has been fixed inadvertantly.

no longer affects: juju/2.2
Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Haw Loeung (hloeung) wrote :

Seeing this with a 2.3.8 to 2.4.3 upgrade: https://pastebin.canonical.com/p/nDJsp6gt66/

Changed in juju:
status: Incomplete → Confirmed
Changed in juju:
status: Confirmed → Triaged
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking as Won't Fix for 2.3 and 2.4 series since we are not planning to make any further releases in these series at this stage.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.