Comment 11 for bug 1338271

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1338271] Re: Analyze OSF DB to estimate win by not caching extracted text

I really like the idea of the upper limit at 64k.

Would you prefer to make a new ticket for actually doing something, or modify this ticket.

--Paul

On Jul 10, 2014, at 11:22 AM, Chris Rossi <email address hidden> wrote:

> It's not.
>
>
> On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:
>
>> In theory, to the extent that we care about database size, the
>> extracted_text inflation is more severe in the repozitory database, *if*
>> extracted_text is even part of what gets serialized.
>>
>> --Paul
>>
>> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
>> wrote:
>>
>>> So I wrote a couple of scripts, attached, to take a look at what
>>> extracted text we have in the database. The most useful way of looking
>>> at it that I could figure was to look at a histogram that shows
>>> distribution of objects by their cached data size. Here's the result of
>>> that analysis:
>>>
>>> chris@curiosity:~/proj/karl/dev$ python histogram.py
>>> total_objects: 70093
>>> total_bytes: 1.80g
>>> median size: 7.21k
>>> distribution:
>>> 1k: 11389
>>> 4k: 13862
>>> 16k: 23052
>>> 64k: 15637
>>> 256k: 4958
>>> 1m: 1134
>>> 4m: 59
>>> 16m: 2
>>>
>>> What we see here is there are 70k documents that have any extracted text
>>> at all. Of those, the vast majority are under 64kb with a median of
>>> about 7kb.
>>>
>>> On the whole, this doesn't strike me as too alarming. But, indeed, we
>>> are loading some extra bytes when unpickling those objects. Since the
>>> extracted data cache is no longer being used, removing the cached data
>>> should be a clear win, reducing the total database size by about 1.8GB.
>>>
>>> The ticket where the extracted data cache was removed didn't really
>>> address the point of having the cache to begin with, namely to speed up
>>> reindexing the text index. This can be pretty useful to have when
>>> making changes to the text index that require reindexing. Although the
>>> text index has been pretty stable for the last couple of years there
>>> are some changes in the pipeline that will probably require reindexing.
>>> This could be an argument for reinstating the cache. I'm happy to
>>> recuse myself from that discussion.
>>>
>>> *If* the cache were reinstated, there are some strategies for mitigating
>>> the negative impact of the cache on every day performance:
>>>
>>> - The cached text could be made its own persistent object, so it isn't
>>> loaded from the ZODB and unpickled unless explicitly accessed, which
>>> should only happen when reindexing a document.
>>>
>>> - We could also set an upper limit to the size of the cached text for
>>> any particular document. For example, if we made the upper limit 64kb,
>>> the vast majority of documents could still benefit from the reindexing
>>> speedup while particularly high impact documents could be prevented from
>>> burdening the database with extra data. Re-extracting text from this
>>> smaller set of documents would add much less time to the total time to
>>> reindex. I just calculated the amount of space that would be saved by
>>> limiting the cached text size to 64kb, and got 1.1GB, which is more than
>>> half of the total space currently used.
>>>
>>> *If*, of course, there is no decision to reinstate the cache, there's
>>> not much point keeping it around.
>>>
>>> ** Changed in: karl3
>>> Status: In Progress => Fix Committed
>>>
>>> --
>>> You received this bug notification because you are subscribed to the bug
>>> report.
>>> https://bugs.launchpad.net/bugs/1338271
>>>
>>> Title:
>>> Analyze OSF DB to estimate win by not caching extracted text
>>>
>>> Status in KARL3:
>>> Fix Committed
>>>
>>> Bug description:
>>> At PyCon, Christian was trying to analyze memory spikes and usage for
>>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
>>> size on object counts. A number that is steady for weeks suddenly
>>> spikes.
>>>
>>> Christian noted that we had some objects in cache that were way too
>>> big. Investigation showed that they had the old "extracted text" hack
>>> we did, where we keep a copy of the extracted content from
>>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>>>
>>> For this task, write a console script that does this same analysis, on
>>> karlstaging, and gives us an idea of the scale of the problem.
>>> Deliberately vague statement of the work, as you need to use some
>>> judgement.
>>>
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>>
>> --
>> You received this bug notification because you are a bug assignee.
>> https://bugs.launchpad.net/bugs/1338271
>>
>> Title:
>> Analyze OSF DB to estimate win by not caching extracted text
>>
>> Status in KARL3:
>> Fix Committed
>>
>> Bug description:
>> At PyCon, Christian was trying to analyze memory spikes and usage for
>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
>> size on object counts. A number that is steady for weeks suddenly
>> spikes.
>>
>> Christian noted that we had some objects in cache that were way too
>> big. Investigation showed that they had the old "extracted text" hack
>> we did, where we keep a copy of the extracted content from
>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>>
>> For this task, write a console script that does this same analysis, on
>> karlstaging, and gives us an idea of the scale of the problem.
>> Deliberately vague statement of the work, as you need to use some
>> judgement.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> At PyCon, Christian was trying to analyze memory spikes and usage for
> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> size on object counts. A number that is steady for weeks suddenly
> spikes.
>
> Christian noted that we had some objects in cache that were way too
> big. Investigation showed that they had the old "extracted text" hack
> we did, where we keep a copy of the extracted content from
> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
> For this task, write a console script that does this same analysis, on
> karlstaging, and gives us an idea of the scale of the problem.
> Deliberately vague statement of the work, as you need to use some
> judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions