Analyze OSF DB to estimate win by not caching extracted text

Bug #1338271 reported by Paul Everitt
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Chris Rossi

Bug Description

At PyCon, Christian was trying to analyze memory spikes and usage for the ZODB cache. We were/are having trouble getting a stable ZODB cache size on object counts. A number that is steady for weeks suddenly spikes.

Christian noted that we had some objects in cache that were way too big. Investigation showed that they had the old "extracted text" hack we did, where we keep a copy of the extracted content from HTML/Office/PDF etc. to speed up reindexing on evolves etc.

For this task, write a console script that does this same analysis, on karlstaging, and gives us an idea of the scale of the problem. Deliberately vague statement of the work, as you need to use some judgement.

Tags: r3.127
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Let's see if we can get a decent console script together to collect some facts. I made Christian and Tres nosy on this, they can dump any historical points I left out.

Changed in karl3:
assignee: nobody → Chris Rossi (chris-archimedeanco)
importance: Undecided → Medium
milestone: none → m138
Revision history for this message
Christian Theune (ctheune) wrote :

Sorry for loosing those scripts - I guess if you don't check something in and push it, then it's gone ... :)

Anyway, the biggest point for me was to simply connect in those scripts to the PostgreSQL DB using psycopg and analyze the hell out of the raw pickle strings instead of going through RelStorage during the diagnosis.

tags: added: r3.127
Changed in karl3:
status: New → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

We are no longer caching extracted text:

3.121 (2014-04-21)
------------------

...

- Stop caching extracted text (the '_extracted_data' attribute) for files.
  Add a script to remove the cached data from existing instances.
  See https://bugs.launchpad.net/karl3/+bug/1309688

Changed in karl3:
status: In Progress → Invalid
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

As it turns out, we no longer cache the extracted text, but we haven't gone back and deleted it.

Changed in karl3:
status: Invalid → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

So I wrote a couple of scripts, attached, to take a look at what extracted text we have in the database. The most useful way of looking at it that I could figure was to look at a histogram that shows distribution of objects by their cached data size. Here's the result of that analysis:

chris@curiosity:~/proj/karl/dev$ python histogram.py
total_objects: 70093
total_bytes: 1.80g
median size: 7.21k
distribution:
  1k: 11389
  4k: 13862
 16k: 23052
 64k: 15637
256k: 4958
  1m: 1134
  4m: 59
 16m: 2

What we see here is there are 70k documents that have any extracted text at all. Of those, the vast majority are under 64kb with a median of about 7kb.

On the whole, this doesn't strike me as too alarming. But, indeed, we are loading some extra bytes when unpickling those objects. Since the extracted data cache is no longer being used, removing the cached data should be a clear win, reducing the total database size by about 1.8GB.

The ticket where the extracted data cache was removed didn't really address the point of having the cache to begin with, namely to speed up reindexing the text index. This can be pretty useful to have when making changes to the text index that require reindexing. Although the text index has been pretty stable for the last couple of years there are some changes in the pipeline that will probably require reindexing. This could be an argument for reinstating the cache. I'm happy to recuse myself from that discussion.

*If* the cache were reinstated, there are some strategies for mitigating the negative impact of the cache on every day performance:

- The cached text could be made its own persistent object, so it isn't loaded from the ZODB and unpickled unless explicitly accessed, which should only happen when reindexing a document.

- We could also set an upper limit to the size of the cached text for any particular document. For example, if we made the upper limit 64kb, the vast majority of documents could still benefit from the reindexing speedup while particularly high impact documents could be prevented from burdening the database with extra data. Re-extracting text from this smaller set of documents would add much less time to the total time to reindex. I just calculated the amount of space that would be saved by limiting the cached text size to 64kb, and got 1.1GB, which is more than half of the total space currently used.

*If*, of course, there is no decision to reinstate the cache, there's not much point keeping it around.

Changed in karl3:
status: In Progress → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1338271] Re: Analyze OSF DB to estimate win by not caching extracted text
Download full text (4.0 KiB)

In theory, to the extent that we care about database size, the extracted_text inflation is more severe in the repozitory database, *if* extracted_text is even part of what gets serialized.

--Paul

On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden> wrote:

> So I wrote a couple of scripts, attached, to take a look at what
> extracted text we have in the database. The most useful way of looking
> at it that I could figure was to look at a histogram that shows
> distribution of objects by their cached data size. Here's the result of
> that analysis:
>
> chris@curiosity:~/proj/karl/dev$ python histogram.py
> total_objects: 70093
> total_bytes: 1.80g
> median size: 7.21k
> distribution:
> 1k: 11389
> 4k: 13862
> 16k: 23052
> 64k: 15637
> 256k: 4958
> 1m: 1134
> 4m: 59
> 16m: 2
>
> What we see here is there are 70k documents that have any extracted text
> at all. Of those, the vast majority are under 64kb with a median of
> about 7kb.
>
> On the whole, this doesn't strike me as too alarming. But, indeed, we
> are loading some extra bytes when unpickling those objects. Since the
> extracted data cache is no longer being used, removing the cached data
> should be a clear win, reducing the total database size by about 1.8GB.
>
> The ticket where the extracted data cache was removed didn't really
> address the point of having the cache to begin with, namely to speed up
> reindexing the text index. This can be pretty useful to have when
> making changes to the text index that require reindexing. Although the
> text index has been pretty stable for the last couple of years there
> are some changes in the pipeline that will probably require reindexing.
> This could be an argument for reinstating the cache. I'm happy to
> recuse myself from that discussion.
>
> *If* the cache were reinstated, there are some strategies for mitigating
> the negative impact of the cache on every day performance:
>
> - The cached text could be made its own persistent object, so it isn't
> loaded from the ZODB and unpickled unless explicitly accessed, which
> should only happen when reindexing a document.
>
> - We could also set an upper limit to the size of the cached text for
> any particular document. For example, if we made the upper limit 64kb,
> the vast majority of documents could still benefit from the reindexing
> speedup while particularly high impact documents could be prevented from
> burdening the database with extra data. Re-extracting text from this
> smaller set of documents would add much less time to the total time to
> reindex. I just calculated the amount of space that would be saved by
> limiting the cached text size to 64kb, and got 1.1GB, which is more than
> half of the total space currently used.
>
> *If*, of course, there is no decision to reinstate the cache, there's
> not much point keeping it around.
>
> ** Changed in: karl3
> Status: In Progress => Fix Committed
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix...

Read more...

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :
Download full text (5.3 KiB)

It's not.

On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:

> In theory, to the extent that we care about database size, the
> extracted_text inflation is more severe in the repozitory database, *if*
> extracted_text is even part of what gets serialized.
>
> --Paul
>
> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> wrote:
>
> > So I wrote a couple of scripts, attached, to take a look at what
> > extracted text we have in the database. The most useful way of looking
> > at it that I could figure was to look at a histogram that shows
> > distribution of objects by their cached data size. Here's the result of
> > that analysis:
> >
> > chris@curiosity:~/proj/karl/dev$ python histogram.py
> > total_objects: 70093
> > total_bytes: 1.80g
> > median size: 7.21k
> > distribution:
> > 1k: 11389
> > 4k: 13862
> > 16k: 23052
> > 64k: 15637
> > 256k: 4958
> > 1m: 1134
> > 4m: 59
> > 16m: 2
> >
> > What we see here is there are 70k documents that have any extracted text
> > at all. Of those, the vast majority are under 64kb with a median of
> > about 7kb.
> >
> > On the whole, this doesn't strike me as too alarming. But, indeed, we
> > are loading some extra bytes when unpickling those objects. Since the
> > extracted data cache is no longer being used, removing the cached data
> > should be a clear win, reducing the total database size by about 1.8GB.
> >
> > The ticket where the extracted data cache was removed didn't really
> > address the point of having the cache to begin with, namely to speed up
> > reindexing the text index. This can be pretty useful to have when
> > making changes to the text index that require reindexing. Although the
> > text index has been pretty stable for the last couple of years there
> > are some changes in the pipeline that will probably require reindexing.
> > This could be an argument for reinstating the cache. I'm happy to
> > recuse myself from that discussion.
> >
> > *If* the cache were reinstated, there are some strategies for mitigating
> > the negative impact of the cache on every day performance:
> >
> > - The cached text could be made its own persistent object, so it isn't
> > loaded from the ZODB and unpickled unless explicitly accessed, which
> > should only happen when reindexing a document.
> >
> > - We could also set an upper limit to the size of the cached text for
> > any particular document. For example, if we made the upper limit 64kb,
> > the vast majority of documents could still benefit from the reindexing
> > speedup while particularly high impact documents could be prevented from
> > burdening the database with extra data. Re-extracting text from this
> > smaller set of documents would add much less time to the total time to
> > reindex. I just calculated the amount of space that would be saved by
> > limiting the cached text size to 64kb, and got 1.1GB, which is more than
> > half of the total space currently used.
> >
> > *If*, of course, there is no decision to reinstate the cache, there's
> > not much point keeping it around.
> >
> > ** Changed in: karl3
> > Status: In Progress => Fix Committed
> >
> > --
> > You received...

Read more...

Revision history for this message
Paul Everitt (paul-agendaless) wrote :
Download full text (6.7 KiB)

I really like the idea of the upper limit at 64k.

Would you prefer to make a new ticket for actually doing something, or modify this ticket.

--Paul

On Jul 10, 2014, at 11:22 AM, Chris Rossi <email address hidden> wrote:

> It's not.
>
>
> On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:
>
>> In theory, to the extent that we care about database size, the
>> extracted_text inflation is more severe in the repozitory database, *if*
>> extracted_text is even part of what gets serialized.
>>
>> --Paul
>>
>> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
>> wrote:
>>
>>> So I wrote a couple of scripts, attached, to take a look at what
>>> extracted text we have in the database. The most useful way of looking
>>> at it that I could figure was to look at a histogram that shows
>>> distribution of objects by their cached data size. Here's the result of
>>> that analysis:
>>>
>>> chris@curiosity:~/proj/karl/dev$ python histogram.py
>>> total_objects: 70093
>>> total_bytes: 1.80g
>>> median size: 7.21k
>>> distribution:
>>> 1k: 11389
>>> 4k: 13862
>>> 16k: 23052
>>> 64k: 15637
>>> 256k: 4958
>>> 1m: 1134
>>> 4m: 59
>>> 16m: 2
>>>
>>> What we see here is there are 70k documents that have any extracted text
>>> at all. Of those, the vast majority are under 64kb with a median of
>>> about 7kb.
>>>
>>> On the whole, this doesn't strike me as too alarming. But, indeed, we
>>> are loading some extra bytes when unpickling those objects. Since the
>>> extracted data cache is no longer being used, removing the cached data
>>> should be a clear win, reducing the total database size by about 1.8GB.
>>>
>>> The ticket where the extracted data cache was removed didn't really
>>> address the point of having the cache to begin with, namely to speed up
>>> reindexing the text index. This can be pretty useful to have when
>>> making changes to the text index that require reindexing. Although the
>>> text index has been pretty stable for the last couple of years there
>>> are some changes in the pipeline that will probably require reindexing.
>>> This could be an argument for reinstating the cache. I'm happy to
>>> recuse myself from that discussion.
>>>
>>> *If* the cache were reinstated, there are some strategies for mitigating
>>> the negative impact of the cache on every day performance:
>>>
>>> - The cached text could be made its own persistent object, so it isn't
>>> loaded from the ZODB and unpickled unless explicitly accessed, which
>>> should only happen when reindexing a document.
>>>
>>> - We could also set an upper limit to the size of the cached text for
>>> any particular document. For example, if we made the upper limit 64kb,
>>> the vast majority of documents could still benefit from the reindexing
>>> speedup while particularly high impact documents could be prevented from
>>> burdening the database with extra data. Re-extracting text from this
>>> smaller set of documents would add much less time to the total time to
>>> reindex. I just calculated the amount of space that would be saved by
>>> limiting the cached text size to 64kb, and got 1.1GB, which is more than
>>> half o...

Read more...

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :
Download full text (8.3 KiB)

Let's make a new ticket, since it involves not only cleaning up and
changing the way cached data is stored, it involves bringing back the code
that was maintaining and using the cache to begin with.

Chris

On Thu, Jul 10, 2014 at 11:33 AM, Paul Everitt <email address hidden> wrote:

> I really like the idea of the upper limit at 64k.
>
> Would you prefer to make a new ticket for actually doing something, or
> modify this ticket.
>
> --Paul
>
> On Jul 10, 2014, at 11:22 AM, Chris Rossi <email address hidden>
> wrote:
>
> > It's not.
> >
> >
> > On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden>
> wrote:
> >
> >> In theory, to the extent that we care about database size, the
> >> extracted_text inflation is more severe in the repozitory database, *if*
> >> extracted_text is even part of what gets serialized.
> >>
> >> --Paul
> >>
> >> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> >> wrote:
> >>
> >>> So I wrote a couple of scripts, attached, to take a look at what
> >>> extracted text we have in the database. The most useful way of looking
> >>> at it that I could figure was to look at a histogram that shows
> >>> distribution of objects by their cached data size. Here's the result
> of
> >>> that analysis:
> >>>
> >>> chris@curiosity:~/proj/karl/dev$ python histogram.py
> >>> total_objects: 70093
> >>> total_bytes: 1.80g
> >>> median size: 7.21k
> >>> distribution:
> >>> 1k: 11389
> >>> 4k: 13862
> >>> 16k: 23052
> >>> 64k: 15637
> >>> 256k: 4958
> >>> 1m: 1134
> >>> 4m: 59
> >>> 16m: 2
> >>>
> >>> What we see here is there are 70k documents that have any extracted
> text
> >>> at all. Of those, the vast majority are under 64kb with a median of
> >>> about 7kb.
> >>>
> >>> On the whole, this doesn't strike me as too alarming. But, indeed, we
> >>> are loading some extra bytes when unpickling those objects. Since the
> >>> extracted data cache is no longer being used, removing the cached data
> >>> should be a clear win, reducing the total database size by about 1.8GB.
> >>>
> >>> The ticket where the extracted data cache was removed didn't really
> >>> address the point of having the cache to begin with, namely to speed up
> >>> reindexing the text index. This can be pretty useful to have when
> >>> making changes to the text index that require reindexing. Although the
> >>> text index has been pretty stable for the last couple of years there
> >>> are some changes in the pipeline that will probably require reindexing.
> >>> This could be an argument for reinstating the cache. I'm happy to
> >>> recuse myself from that discussion.
> >>>
> >>> *If* the cache were reinstated, there are some strategies for
> mitigating
> >>> the negative impact of the cache on every day performance:
> >>>
> >>> - The cached text could be made its own persistent object, so it isn't
> >>> loaded from the ZODB and unpickled unless explicitly accessed, which
> >>> should only happen when reindexing a document.
> >>>
> >>> - We could also set an upper limit to the size of the cached text for
> >>> any particular document. For example, if we made the upper limit 64kb,
> >>> the vast majority of documents could sti...

Read more...

Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.