memory leak for _safe_log in openstack.common.rpc

Bug #1365892 reported by jazeltq
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
High
jazeltq

Bug Description

we first find this bug in ceilometer.collector, after create 50k instances, the ceilometer.collectos uses 5.7 G memory. So we think there is an memory leak. After some research on this issue we think the problem is in function _safe_log.
def _safe_log(log_func, msg, msg_data):
    """Sanitizes the msg_data field before logging."""
    SANITIZE = ['_context_auth_token', 'auth_token', 'new_pass']

    def _fix_passwords(d):
        """Sanitizes the password fields in the dictionary."""
        for k in d.iterkeys():
            if k.lower().find('password') != -1:
                d[k] = '<SANITIZED>'
            elif k.lower() in SANITIZE:
                d[k] = '<SANITIZED>'
            elif isinstance(d[k], dict):
                _fix_passwords(d[k])
        return d

    return log_func(msg, _fix_passwords(copy.deepcopy(msg_data)))

we find that in gc.garbage, there are many things like those:
['_context_auth_token', 'auth_token', 'new_pass'],
 (<cell at 0x363e1a0: list object at 0x363cdd0>,
  <cell at 0x363e130: function object at 0x361a7d0>),
 <function _fix_passwords at 0x361a7d0>,
 <cell at 0x363e2f0: list object at 0x361b368>,
 <cell at 0x363e4e8: function object at 0x361a8c0>,
 ['_context_auth_token', 'auth_token', 'new_pass'],
 (<cell at 0x363e2f0: list object at 0x361b368>,
  <cell at 0x363e4e8: function object at 0x361a8c0>),
 <function _fix_passwords at 0x361a8c0>,
 <cell at 0x363e520: list object at 0x3640440>,
 <cell at 0x363e4b0: function object at 0x361a9b0>,
 ['_context_auth_token', 'auth_token', 'new_pass'],
 (<cell at 0x363e520: list object at 0x3640440>,
  <cell at 0x363e4b0: function object at 0x361a9b0>),
 <function _fix_passwords at 0x361a9b0>,
 <cell at 0x363e670: list object at 0x361b4d0>,
 <cell at 0x363e868: function object at 0x361aa28>,
 ['_context_auth_token', 'auth_token', 'new_pass'],
 (<cell at 0x363e670: list object at 0x361b4d0>,
  <cell at 0x363e868: function object at 0x361aa28>),

And we change the function a little, like this
def _safe_log(log_func, msg, msg_data):
    """Sanitizes the msg_data field before logging."""
    SANITIZE = ['_context_auth_token', 'auth_token', 'new_pass']

    def _fix_passwords(d):
        """Sanitizes the password fields in the dictionary."""
        for k in six.iterkeys(d):
            if k.lower().find('password') != -1:
                d[k] = '<SANITIZED>'
            elif k.lower() in SANITIZE:
                d[k] = '<SANITIZED>'
            elif isinstance(d[k], list):
                for e in d[k]:
                    if isinstance(e, dict):
                        _fix_passwords(e)
            elif isinstance(d[k], dict):
                _fix_passwords(d[k])
        return d
    res = log_func(msg, _fix_passwords(copy.deepcopy(msg_data)))
    _fix_passwords = None
    return res

After those, change, we find there are no more things find above.

Revision history for this message
jazeltq (jazeltq-k) wrote :
Changed in ceilometer:
assignee: nobody → jazeltq (jazeltq-k)
description: updated
jazeltq (jazeltq-k)
affects: ceilometer → oslo-incubator
no longer affects: oslo-incubator
jazeltq (jazeltq-k)
Changed in oslo.messaging:
assignee: nobody → jazeltq (jazeltq-k)
jazeltq (jazeltq-k)
description: updated
jazeltq (jazeltq-k)
Changed in oslo.messaging:
status: New → In Progress
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/119524

Changed in oslo.messaging:
importance: Undecided → High
Revision history for this message
jazeltq (jazeltq-k) wrote :

After some research on this issue, and some test code, i found the gc can not collect recursive closure.
The test code is
#!/usr/bin/env python

import ipdb
import gc

gc.set_debug(gc.DEBUG_LEAK)

def A():
 N = [1, 200, 300, 400]
 def aa(n):
  if n in N:
   return 1
  else:
   return n * aa(n-1)

 x = 33 + aa(10)
 #ipdb.set_trace()
 print x

if __name__ == '__main__':
 while xrange(1000):
  A()

you can run python x.py
and use top -p pid to get the stat of that process. On my box, the rss is used very quickly.

When this function is called, it will generate a new function object. And the gc can not collect the old function object.
So the memory leak is happened.
We can whether put the inner function outer or cut the cycle before the function returned.

Revision history for this message
jazeltq (jazeltq-k) wrote :

And in https://thp.io/2012/python-gc/python_gc_final_2012-01-22.pdf
it says in page 5 that:
We observe that Python implementations with distinct GCs behave differently: CPython does not even try to get the order of
finalizers right, and simply puts uncollectable objects into the global list of garbage for the developer to deal with manually.

So CPython can not handle those cycle reference.

Revision history for this message
jazeltq (jazeltq-k) wrote :

The solution is not very work well, it just reduce the memory leak very quickly.
 After cut the cycle, i test ceilometer-collector with rally.
what rally do is boot an instances, run a command, delete the instances, the pipeline interval is 60.
After 3k, the ceilometer-collector use 1.2 G, and the memory is still going to higher. But i do no know why, i am going to
research continually. If someone met this problem and want to post you solutions, and consideration, please let me know.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/119524
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=93283f2c2de4d36e6f95ba8d49c8eb0e4a317c13
Submitter: Jenkins
Branch: master

commit 93283f2c2de4d36e6f95ba8d49c8eb0e4a317c13
Author: jazeltq <jazeltq@163.com>
Date: Sat Sep 6 10:43:09 2014 +0800

    fix memory leak for function _safe_log

    In _safe_log it uses recursive closure. The gc can not free stack for
    this recursive closure for cycle-reference. So cut down the cycle
    to fix it.

    Resolved-bug: #1365892
    Change-Id: I05fbd7091b886344d76b172a5ad66d9225599d69

Revision history for this message
jazeltq (jazeltq-k) wrote :

the storage use pymongo2.6 to insert data to mongo, but this version will cause memory leak. If you met the same problem, please
upgrade you pymongo too.

https://jira.mongodb.org/browse/PYTHON-564

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: In Progress → Fix Committed
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
milestone: none → 1.5.0
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.