[MultiKARL] ConnectionError: Error 24 connecting multikarl01.gocept.net:6379. Too many open files.

Bug #1084603 reported by Alexander Bittner
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Nat Katin-Borland

Bug Description

Hello,

about half an hour ago, we experienced a problem with our MultiKARL (Ariadne, Oxfam, Privacy) installation which made the application unavailable. The timeline:

* Nagios alerted us that Ariadne, Oxfam, and Privacy were not available any more as well as that the KARL process itself was not responding to HTTP requests any more
* supervisorctl showed all processes as running (karl, search, mailin, mailout)
* However, we've seen the following two tracebacks when tailing the log for KARL:

Traceback (most recent call last):
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 1068, in process_request_in_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 442, in handle
    BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 437, in handle_one_request
    self.wsgi_execute()
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 287, in wsgi_execute
    self.wsgi_start_response)
  File "/srv/multikarl/production/12/eggs/guillotine-0.1-py2.6.egg/guillotine/__init__.py", line 10, in __call__
    return self.app(environ, start_response)
  File "/srv/multikarl/production/12/eggs/repoze.vhm-0.13-py2.6.egg/repoze/vhm/middleware.py", line 106, in __call__
    return self.application(environ, start_response)
  File "/srv/multikarl/production/12/eggs/repoze.browserid-0.3-py2.6.egg/repoze/browserid/middleware.py", line 127, in __call__
    return self.app(environ, start_response)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 176, in __call__
    response = self.handle_request(request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/tweens.py", line 17, in excview_tween
    response = handler(request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 153, in handle_request
    response = view_callable(context, request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/config/views.py", line 319, in viewresult_to_response
    result = view(context, request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/config/views.py", line 402, in _requestonly_view
    response = view(request)

  File "/srv/multikarl/production/12/eggs/karlserve-1.18-py2.6.egg/karlserve/application.py", line 135, in site_dispatch
    return request.get_response(instance.pipeline())
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1086, in get_response
    application, catch_exc_info=False)
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1055, in call_application
    app_iter = application(self.environ, start_response)
  File "/srv/multikarl/production/12/eggs/repoze.urchin-0.2-py2.6.egg/repoze/urchin/__init__.py", line 53, in __call__
    resp = req.get_response(self.app)
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1086, in get_response
    application, catch_exc_info=False)
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1055, in call_application
    app_iter = application(self.environ, start_response)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 176, in __call__
    response = self.handle_request(request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/tweens.py", line 34, in excview_tween
    response = view_callable(exc, request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/config/views.py", line 292, in rendered_view
    result = view(context, request)
  File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/errorpage.py", line 75, in errorpage
    logging.getLogger('karl').error('\n'.join(message), exc_info=True)
  File "/usr/lib/python2.6/logging/__init__.py", line 1082, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/usr/lib/python2.6/logging/__init__.py", line 1173, in _log
    self.handle(record)
  File "/usr/lib/python2.6/logging/__init__.py", line 1183, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.6/logging/__init__.py", line 1220, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.6/logging/__init__.py", line 679, in handle
    self.emit(record)
  File "/srv/multikarl/production/12/eggs/karlserve-1.18-py2.6.egg/karlserve/log.py", line 86, in emit
    log.log(record.levelname, self.subsystem, message, exc_info)
  File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/redislog.py", line 59, in log
    record = [(key if key else NORECORD) for key in tx.execute()]
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/client.py", line 1528, in execute
    return execute(conn, stack)
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/client.py", line 1453, in _execute_transaction
    connection.send_packed_command(all_cmds)
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/connection.py", line 241, in send_packed_command
    self.connect()
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/connection.py", line 189, in connect
    raise ConnectionError(self._error_message(e))
ConnectionError: Error -5 connecting multikarl01.gocept.net:6379. No address associated with hostname.

multikarl01 is the database server (PostgreSQL and Redis). This traceback showed up only *one time*, directly followed by many occurances of the second traceback:

Traceback (most recent call last):
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 1068, in process_request_in_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 442, in handle
    BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 437, in handle_one_request
    self.wsgi_execute()
  File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 287, in wsgi_execute
    self.wsgi_start_response)
  File "/srv/multikarl/production/12/eggs/guillotine-0.1-py2.6.egg/guillotine/__init__.py", line 10, in __call__
    return self.app(environ, start_response)
  File "/srv/multikarl/production/12/eggs/repoze.vhm-0.13-py2.6.egg/repoze/vhm/middleware.py", line 106, in __call__
    return self.application(environ, start_response)
  File "/srv/multikarl/production/12/eggs/repoze.browserid-0.3-py2.6.egg/repoze/browserid/middleware.py", line 127, in __call__
    return self.app(environ, start_response)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 176, in __call__
    response = self.handle_request(request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/tweens.py", line 17, in excview_tween
    response = handler(request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 153, in handle_request
    response = view_callable(context, request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/config/views.py", line 319, in viewresult_to_response
    result = view(context, request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/config/views.py", line 402, in _requestonly_view
    response = view(request)
  File "/srv/multikarl/production/12/eggs/karlserve-1.18-py2.6.egg/karlserve/application.py", line 135, in site_dispatch
    return request.get_response(instance.pipeline())
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1086, in get_response
    application, catch_exc_info=False)
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1055, in call_application
    app_iter = application(self.environ, start_response)
  File "/srv/multikarl/production/12/eggs/repoze.urchin-0.2-py2.6.egg/repoze/urchin/__init__.py", line 53, in __call__
    resp = req.get_response(self.app)
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1086, in get_response
    application, catch_exc_info=False)
  File "/srv/multikarl/production/12/eggs/WebOb-1.1.1-py2.6.egg/webob/request.py", line 1055, in call_application
    app_iter = application(self.environ, start_response)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 176, in __call__
    response = self.handle_request(request) File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/tweens.py", line 34, in excview_tween
    response = view_callable(exc, request)
  File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/config/views.py", line 292, in rendered_view
    result = view(context, request)
  File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/errorpage.py", line 75, in errorpage
    logging.getLogger('karl').error('\n'.join(message), exc_info=True)
  File "/usr/lib/python2.6/logging/__init__.py", line 1082, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/usr/lib/python2.6/logging/__init__.py", line 1173, in _log
    self.handle(record)
  File "/usr/lib/python2.6/logging/__init__.py", line 1183, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.6/logging/__init__.py", line 1220, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.6/logging/__init__.py", line 679, in handle
    self.emit(record)
  File "/srv/multikarl/production/12/eggs/karlserve-1.18-py2.6.egg/karlserve/log.py", line 86, in emit
    log.log(record.levelname, self.subsystem, message, exc_info)
  File "/srv/multikarl/production/12/eggs/karl-3.99-py2.6.egg/karl/redislog.py", line 59, in log
    record = [(key if key else NORECORD) for key in tx.execute()]
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/client.py", line 1528, in execute
    return execute(conn, stack)
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/client.py", line 1453, in _execute_transaction
    connection.send_packed_command(all_cmds)
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/connection.py", line 241, in send_packed_command
    self.connect()
  File "/srv/multikarl/production/12/eggs/redis-2.4.11-py2.6.egg/redis/connection.py", line 189, in connect
    raise ConnectionError(self._error_message(e))
ConnectionError: Error 24 connecting multikarl01.gocept.net:6379. Too many open files.

* What helped us getting the application working again was to restart the karl instance, followed by (a few minutes later) the search instance. Timely, the Nagios checks became green again after restarting the search instance. However, we're unsure if solely restarting the search instance would have been sufficient.

Do you have an idea what might have happened here?

Best regards,
Alex

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Handing over to Chris for some brainstorming.

Changed in karl3:
assignee: nobody → Chris Rossi (chris-archimedeanco)
importance: Undecided → Medium
milestone: none → m123
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Chris, spend a small bit of time giving your thoughts, then stick it in M124 (the milestone for, hopefully, January)

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

The first stack trace indicates another brief DNS hiccup. Why that cascaded into too many open files I'm not really sure. Reproducing that would be somewhat difficult.

Did anyone happen to use 'lsof' to look at where sockets/files were open and which process(es) had them open? Were the files open on the app server or the db server? Were they associated with the Redis server or is Redis merely the canary in the coal mine here, with perhaps a Postgres client gobbling up all the sockets?

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Chris answered, back over to gocept.

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Alexander Bittner (yb)
milestone: m123 → m124
status: New → In Progress
Revision history for this message
Christian Theune (ctheune) wrote :

Interesting question. We saw the same thing yesterday which also sounded like another resolve issue in the first place.

However, we have a hard time reading those logs as the output doesn't contain any timestamps. I'll make another bug regarding that.

Revision history for this message
Christian Theune (ctheune) wrote :
Download full text (3.9 KiB)

I found a proof of what's going on.

It appears that the karl process is leaking file handles:

paster 24833 multikarl 839u REG 253,1 3279 385234 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 840u REG 253,1 3279 385234 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 841u REG 253,1 3279 385234 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 842u REG 253,1 3279 385234 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 843u REG 253,1 3279 385234 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 844u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 845u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 846u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 847u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 848u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 849u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 850u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 851u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 852u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 853u REG 253,1 3086 385285 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 854u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 855u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 856u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 857u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 858u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 859u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 860u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 861u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 862u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 863u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 864u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 865u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 866u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 867u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 868u REG 253,1 3086 385539 /var/tmp/HTTP_1254 (deleted)
paster 24833 multikarl 869u REG 253,1 3086 385539 /v...

Read more...

Changed in karl3:
assignee: Alexander Bittner (yb) → nobody
Revision history for this message
Christian Theune (ctheune) wrote :

Gah. I seem I can't assign it back to Chris ... :/

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Nat is the POC for work that might be scheduled in Q1. We'll park it in the M124 "figure out Q1 later" milestone.

Changed in karl3:
assignee: nobody → Nat Katin-Borland (nborland)
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I've taken a guess at something that might be the problem and made a stab at fixing it. I've patched multikarl production to use the fix so I can see whether or not it works. It involves the kerberos login and possible failure to clean up if there are errors. We'll need to monitor to see whether the filehandles start to stack up again or not.

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I've been monitoring multikarl since Wednesday and haven't found any more dangling filehandles, so I'm going to assert a 90% probability that I really did fix the problem. Multikarl is running with a patched installation at the momeent. The fix is on master and will be incorporated into the next official release of Karl. I will mark this as fixed.

Changed in karl3:
status: In Progress → Fix Committed
Revision history for this message
Christian Theune (ctheune) wrote :

FYI ... I just verified the logs since then and haven't found any other occurence. However, I'd rather wait a couple of days until traffic picks up in the new year before declaring a "win". :)

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1084603] [MultiKARL] ConnectionError: Error 24 connecting multikarl01.gocept.net:6379. Too many open files.
Download full text (12.7 KiB)

Sounds good, thanks for following up on this.

--Paul

On Jan 3, 2013, at 8:28 AM, Christian Theune <email address hidden> wrote:

> FYI ... I just verified the logs since then and haven't found any other
> occurence. However, I'd rather wait a couple of days until traffic picks
> up in the new year before declaring a "win". :)
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/1084603
>
> Title:
> [MultiKARL] ConnectionError: Error 24 connecting
> multikarl01.gocept.net:6379. Too many open files.
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> Hello,
>
> about half an hour ago, we experienced a problem with our MultiKARL
> (Ariadne, Oxfam, Privacy) installation which made the application
> unavailable. The timeline:
>
> * Nagios alerted us that Ariadne, Oxfam, and Privacy were not available any more as well as that the KARL process itself was not responding to HTTP requests any more
> * supervisorctl showed all processes as running (karl, search, mailin, mailout)
> * However, we've seen the following two tracebacks when tailing the log for KARL:
>
> Traceback (most recent call last):
> File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 1068, in process_request_in_thread
> self.finish_request(request, client_address)
> File "/usr/lib/python2.6/SocketServer.py", line 322, in finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/usr/lib/python2.6/SocketServer.py", line 617, in __init__
> self.handle()
> File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 442, in handle
> BaseHTTPRequestHandler.handle(self)
> File "/usr/lib/python2.6/BaseHTTPServer.py", line 329, in handle
> self.handle_one_request()
> File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 437, in handle_one_request
> self.wsgi_execute()
> File "/srv/multikarl/production/12/eggs/Paste-1.7.5.1-py2.6.egg/paste/httpserver.py", line 287, in wsgi_execute
> self.wsgi_start_response)
> File "/srv/multikarl/production/12/eggs/guillotine-0.1-py2.6.egg/guillotine/__init__.py", line 10, in __call__
> return self.app(environ, start_response)
> File "/srv/multikarl/production/12/eggs/repoze.vhm-0.13-py2.6.egg/repoze/vhm/middleware.py", line 106, in __call__
> return self.application(environ, start_response)
> File "/srv/multikarl/production/12/eggs/repoze.browserid-0.3-py2.6.egg/repoze/browserid/middleware.py", line 127, in __call__
> return self.app(environ, start_response)
> File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 176, in __call__
> response = self.handle_request(request)
> File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/tweens.py", line 17, in excview_tween
> response = handler(request)
> File "/srv/multikarl/production/12/eggs/pyramid-1.2.1-py2.6.egg/pyramid/router.py", line 153, in handle_request
> response = view_callable(context, request)
> File "/srv/multikarl/production/12/eggs/...

Revision history for this message
Christian Theune (ctheune) wrote :

We haven't seen this issue at all since the bugfix so I declare victory. Someone can mark this as solved, please? :)

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Awesome. :)

Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.