KARL3

Bug #1338271
Comment #11

Comment 11 for bug 1338271

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2014-07-10: Re: [Bug 1338271] Re: Analyze OSF DB to estimate win by not caching extracted text

#11

I really like the idea of the upper limit at 64k.

Would you prefer to make a new ticket for actually doing something, or modify this ticket.

--Paul

On Jul 10, 2014, at 11:22 AM, Chris Rossi <email address hidden> wrote:

> It's not.
>
>
> On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:
>
>> In theory, to the extent that we care about database size, the
>> extracted_text inflation is more severe in the repozitory database, *if*
>> extracted_text is even part of what gets serialized.
>>
>> --Paul
>>
>> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
>> wrote:
>>
>>> So I wrote a couple of scripts, attached, to take a look at what
>>> extracted text we have in the database. The most useful way of looking
>>> at it that I could figure was to look at a histogram that shows
>>> distribution of objects by their cached data size. Here's the result of
>>> that analysis:
>>>
>>> chris@curiosity:~/proj/karl/dev$ python histogram.py
>>> total_objects: 70093
>>> total_bytes: 1.80g
>>> median size: 7.21k
>>> distribution:
>>> 1k: 11389
>>> 4k: 13862
>>> 16k: 23052
>>> 64k: 15637
>>> 256k: 4958
>>> 1m: 1134
>>> 4m: 59
>>> 16m: 2
>>>
>>> What we see here is there are 70k documents that have any extracted text
>>> at all. Of those, the vast majority are under 64kb with a median of
>>> about 7kb.
>>>
>>> On the whole, this doesn't strike me as too alarming. But, indeed, we
>>> are loading some extra bytes when unpickling those objects. Since the
>>> extracted data cache is no longer being used, removing the cached data
>>> should be a clear win, reducing the total database size by about 1.8GB.
>>>
>>> The ticket where the extracted data cache was removed didn't really
>>> address the point of having the cache to begin with, namely to speed up
>>> reindexing the text index. This can be pretty useful to have when
>>> making changes to the text index that require reindexing. Although the
>>> text index has been pretty stable for the last couple of years there
>>> are some changes in the pipeline that will probably require reindexing.
>>> This could be an argument for reinstating the cache. I'm happy to
>>> recuse myself from that discussion.
>>>
>>> *If* the cache were reinstated, there are some strategies for mitigating
>>> the negative impact of the cache on every day performance:
>>>
>>> - The cached text could be made its own persistent object, so it isn't
>>> loaded from the ZODB and unpickled unless explicitly accessed, which
>>> should only happen when reindexing a document.
>>>
>>> - We could also set an upper limit to the size of the cached text for
>>> any particular document. For example, if we made the upper limit 64kb,
>>> the vast majority of documents could still benefit from the reindexing
>>> speedup while particularly high impact documents could be prevented from
>>> burdening the database with extra data. Re-extracting text from this
>>> smaller set of documents would add much less time to the total time to
>>> reindex. I just calculated the amount of space that would be saved by
>>> limiting the cached text size to 64kb, and got 1.1GB, which is more than
>>> half of the total space currently used.
>>>
>>> *If*, of course, there is no decision to reinstate the cache, there's
>>> not much point keeping it around.
>>>
>>> ** Changed in: karl3
>>> Status: In Progress => Fix Committed
>>>
>>> --
>>> You received this bug notification because you are subscribed to the bug
>>> report.
>>> https://bugs.launchpad.net/bugs/1338271
>>>
>>> Title:
>>> Analyze OSF DB to estimate win by not caching extracted text
>>>
>>> Status in KARL3:
>>> Fix Committed
>>>
>>> Bug description:
>>> At PyCon, Christian was trying to analyze memory spikes and usage for
>>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
>>> size on object counts. A number that is steady for weeks suddenly
>>> spikes.
>>>
>>> Christian noted that we had some objects in cache that were way too
>>> big. Investigation showed that they had the old "extracted text" hack
>>> we did, where we keep a copy of the extracted content from
>>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>>>
>>> For this task, write a console script that does this same analysis, on
>>> karlstaging, and gives us an idea of the scale of the problem.
>>> Deliberately vague statement of the work, as you need to use some
>>> judgement.
>>>
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>>
>> --
>> You received this bug notification because you are a bug assignee.
>> https://bugs.launchpad.net/bugs/1338271
>>
>> Title:
>> Analyze OSF DB to estimate win by not caching extracted text
>>
>> Status in KARL3:
>> Fix Committed
>>
>> Bug description:
>> At PyCon, Christian was trying to analyze memory spikes and usage for
>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
>> size on object counts. A number that is steady for weeks suddenly
>> spikes.
>>
>> Christian noted that we had some objects in cache that were way too
>> big. Investigation showed that they had the old "extracted text" hack
>> we did, where we keep a copy of the extracted content from
>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>>
>> For this task, write a console script that does this same analysis, on
>> karlstaging, and gives us an idea of the scale of the problem.
>> Deliberately vague statement of the work, as you need to use some
>> judgement.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> At PyCon, Christian was trying to analyze memory spikes and usage for
> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> size on object counts. A number that is steady for weeks suddenly
> spikes.
>
> Christian noted that we had some objects in cache that were way too
> big. Investigation showed that they had the old "extracted text" hack
> we did, where we keep a copy of the extracted content from
> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
> For this task, write a console script that does this same analysis, on
> karlstaging, and gives us an idea of the scale of the problem.
> Deliberately vague statement of the work, as you need to use some
> judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions

I really like the idea of the upper limit at 64k.

Would you prefer to make a new ticket for actually doing something, or modify this ticket.

--Paul

On Jul 10, 2014, at 11:22 AM, Chris Rossi <chris@archimedeanco.com> wrote:

> It's not.
> 
> 
> On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <paul@agendaless.com> wrote:
> 
>> In theory, to the extent that we care about database size, the
>> extracted_text inflation is more severe in the repozitory database, *if*
>> extracted_text is even part of what gets serialized.
>> 
>> --Paul
>> 
>> On Jul 10, 2014, at 10:15 AM, Chris Rossi <chris@archimedeanco.com>
>> wrote:
>> 
>>> So I wrote a couple of scripts, attached, to take a look at what
>>> extracted text we have in the database.  The most useful way of looking
>>> at it that I could figure was to look at a histogram that shows
>>> distribution of objects by their cached data size.  Here's the result of
>>> that analysis:
>>> 
>>> chris@curiosity:~/proj/karl/dev$ python histogram.py
>>> total_objects: 70093
>>> total_bytes: 1.80g
>>> median size: 7.21k
>>> distribution:
>>> 1k: 11389
>>> 4k: 13862
>>> 16k: 23052
>>> 64k: 15637
>>> 256k: 4958
>>> 1m: 1134
>>> 4m: 59
>>> 16m: 2
>>> 
>>> What we see here is there are 70k documents that have any extracted text
>>> at all.  Of those, the vast majority are under 64kb with a median of
>>> about 7kb.
>>> 
>>> On the whole, this doesn't strike me as too alarming.  But, indeed, we
>>> are loading some extra bytes when unpickling those objects.  Since the
>>> extracted data cache is no longer being used, removing the cached data
>>> should be a clear win, reducing the total database size by about 1.8GB.
>>> 
>>> The ticket where the extracted data cache was removed didn't really
>>> address the point of having the cache to begin with, namely to speed up
>>> reindexing the text index.  This can be pretty useful to have when
>>> making changes to the text index that require reindexing.  Although the
>>> text index has been pretty stable for the last couple of years  there
>>> are some changes in the pipeline that will probably require reindexing.
>>> This could be an argument for reinstating the cache.  I'm happy to
>>> recuse myself from that discussion.
>>> 
>>> *If* the cache were reinstated, there are some strategies for mitigating
>>> the negative impact of the cache on every day performance:
>>> 
>>> - The cached text could be made its own persistent object, so it isn't
>>> loaded from the ZODB and unpickled unless explicitly accessed, which
>>> should only happen when reindexing a document.
>>> 
>>> - We could also set an upper limit to the size of the cached text for
>>> any particular document.  For example, if we made the upper limit 64kb,
>>> the vast majority of documents could still benefit from the reindexing
>>> speedup while particularly high impact documents could be prevented from
>>> burdening the database with extra data.  Re-extracting text from this
>>> smaller set of documents would add much less time to the total time to
>>> reindex.  I just calculated the amount of space that would be saved by
>>> limiting the cached text size to 64kb, and got 1.1GB, which is more than
>>> half of the total space currently used.
>>> 
>>> *If*, of course, there is no decision to reinstate the cache, there's
>>> not much point keeping it around.
>>> 
>>> ** Changed in: karl3
>>>      Status: In Progress => Fix Committed
>>> 
>>> --
>>> You received this bug notification because you are subscribed to the bug
>>> report.
>>> https://bugs.launchpad.net/bugs/1338271
>>> 
>>> Title:
>>> Analyze OSF DB to estimate win by not caching extracted text
>>> 
>>> Status in KARL3:
>>> Fix Committed
>>> 
>>> Bug description:
>>> At PyCon, Christian was trying to analyze memory spikes and usage for
>>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
>>> size on object counts. A number that is steady for weeks suddenly
>>> spikes.
>>> 
>>> Christian noted that we had some objects in cache that were way too
>>> big. Investigation showed that they had the old "extracted text" hack
>>> we did, where we keep a copy of the extracted content from
>>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>>> 
>>> For this task, write a console script that does this same analysis, on
>>> karlstaging, and gives us an idea of the scale of the problem.
>>> Deliberately vague statement of the work, as you need to use some
>>> judgement.
>>> 
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>> 
>> --
>> You received this bug notification because you are a bug assignee.
>> https://bugs.launchpad.net/bugs/1338271
>> 
>> Title:
>>  Analyze OSF DB to estimate win by not caching extracted text
>> 
>> Status in KARL3:
>>  Fix Committed
>> 
>> Bug description:
>>  At PyCon, Christian was trying to analyze memory spikes and usage for
>>  the ZODB cache. We were/are having trouble getting a stable ZODB cache
>>  size on object counts. A number that is steady for weeks suddenly
>>  spikes.
>> 
>>  Christian noted that we had some objects in cache that were way too
>>  big. Investigation showed that they had the old "extracted text" hack
>>  we did, where we keep a copy of the extracted content from
>>  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>> 
>>  For this task, write a console script that does this same analysis, on
>>  karlstaging, and gives us an idea of the scale of the problem.
>>  Deliberately vague statement of the work, as you need to use some
>>  judgement.
>> 
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>> 
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338271
> 
> Title:
>  Analyze OSF DB to estimate win by not caching extracted text
> 
> Status in KARL3:
>  Fix Committed
> 
> Bug description:
>  At PyCon, Christian was trying to analyze memory spikes and usage for
>  the ZODB cache. We were/are having trouble getting a stable ZODB cache
>  size on object counts. A number that is steady for weeks suddenly
>  spikes.
> 
>  Christian noted that we had some objects in cache that were way too
>  big. Investigation showed that they had the old "extracted text" hack
>  we did, where we keep a copy of the extracted content from
>  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> 
>  For this task, write a console script that does this same analysis, on
>  karlstaging, and gives us an idea of the scale of the problem.
>  Deliberately vague statement of the work, as you need to use some
>  judgement.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions