Bug #1338271 “Analyze OSF DB to estimate win by not caching extr...” : Bugs : KARL3

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2014-07-06:

#1

Let's see if we can get a decent console script together to collect some facts. I made Christian and Tres nosy on this, they can dump any historical points I left out.

Changed in karl3:
assignee:	nobody → Chris Rossi (chris-archimedeanco)
importance:	Undecided → Medium
milestone:	none → m138

Revision history for this message

Christian Theune (ctheune) wrote on 2014-07-07:

#2

Sorry for loosing those scripts - I guess if you don't check something in and push it, then it's gone ... :)

Anyway, the biggest point for me was to simply connect in those scripts to the PostgreSQL DB using psycopg and analyze the hell out of the raw pickle strings instead of going through RelStorage during the diagnosis.

Paul Everitt (paul-agendaless) on 2014-07-09

tags:

added: r3.127

Chris Rossi (chris-archimedeanco) on 2014-07-09

Changed in karl3:
status:	New → In Progress

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-09:

#3

We are no longer caching extracted text:

3.121 (2014-04-21)
------------------

...

- Stop caching extracted text (the '_extracted_data' attribute) for files.
Add a script to remove the cached data from existing instances.
See https://bugs.launchpad.net/karl3/+bug/1309688

Changed in karl3:
status:	In Progress → Invalid

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-09:

#4

As it turns out, we no longer cache the extracted text, but we haven't gone back and deleted it.

Changed in karl3:
status:	Invalid → In Progress

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-10:

#5

Script to analyze data about size of extracted data Edit (830 bytes, text/x-python)

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-10:

#7

Script to generate size data on cached text data, for use with 'karlserve debug -S ' Edit (576 bytes, text/x-python)

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-10:

#8

So I wrote a couple of scripts, attached, to take a look at what extracted text we have in the database. The most useful way of looking at it that I could figure was to look at a histogram that shows distribution of objects by their cached data size. Here's the result of that analysis:

chris@curiosity:~/proj/karl/dev$ python histogram.py
total_objects: 70093
total_bytes: 1.80g
median size: 7.21k
distribution:
  1k: 11389
  4k: 13862
16k: 23052
64k: 15637
256k: 4958
  1m: 1134
  4m: 59
16m: 2

What we see here is there are 70k documents that have any extracted text at all. Of those, the vast majority are under 64kb with a median of about 7kb.

On the whole, this doesn't strike me as too alarming. But, indeed, we are loading some extra bytes when unpickling those objects. Since the extracted data cache is no longer being used, removing the cached data should be a clear win, reducing the total database size by about 1.8GB.

The ticket where the extracted data cache was removed didn't really address the point of having the cache to begin with, namely to speed up reindexing the text index. This can be pretty useful to have when making changes to the text index that require reindexing. Although the text index has been pretty stable for the last couple of years there are some changes in the pipeline that will probably require reindexing. This could be an argument for reinstating the cache. I'm happy to recuse myself from that discussion.

*If* the cache were reinstated, there are some strategies for mitigating the negative impact of the cache on every day performance:

- The cached text could be made its own persistent object, so it isn't loaded from the ZODB and unpickled unless explicitly accessed, which should only happen when reindexing a document.

- We could also set an upper limit to the size of the cached text for any particular document. For example, if we made the upper limit 64kb, the vast majority of documents could still benefit from the reindexing speedup while particularly high impact documents could be prevented from burdening the database with extra data. Re-extracting text from this smaller set of documents would add much less time to the total time to reindex. I just calculated the amount of space that would be saved by limiting the cached text size to 64kb, and got 1.1GB, which is more than half of the total space currently used.

*If*, of course, there is no decision to reinstate the cache, there's not much point keeping it around.

So I wrote a couple of scripts, attached, to take a look at what extracted text we have in the database.  The most useful way of looking at it that I could figure was to look at a histogram that shows distribution of objects by their cached data size.  Here's the result of that analysis:

chris@curiosity:~/proj/karl/dev$ python histogram.py 
total_objects: 70093
total_bytes: 1.80g
median size: 7.21k
distribution:
  1k: 11389
  4k: 13862
 16k: 23052
 64k: 15637
256k: 4958
  1m: 1134
  4m: 59
 16m: 2

What we see here is there are 70k documents that have any extracted text at all.  Of those, the vast majority are under 64kb with a median of about 7kb.

On the whole, this doesn't strike me as too alarming.  But, indeed, we are loading some extra bytes when unpickling those objects.  Since the extracted data cache is no longer being used, removing the cached data should be a clear win, reducing the total database size by about 1.8GB.

The ticket where the extracted data cache was removed didn't really address the point of having the cache to begin with, namely to speed up reindexing the text index.  This can be pretty useful to have when making changes to the text index that require reindexing.  Although the text index has been pretty stable for the last couple of years  there are some changes in the pipeline that will probably require reindexing.  This could be an argument for reinstating the cache.  I'm happy to recuse myself from that discussion.

*If* the cache were reinstated, there are some strategies for mitigating the negative impact of the cache on every day performance:

- The cached text could be made its own persistent object, so it isn't loaded from the ZODB and unpickled unless explicitly accessed, which should only happen when reindexing a document.

- We could also set an upper limit to the size of the cached text for any particular document.  For example, if we made the upper limit 64kb, the vast majority of documents could still benefit from the reindexing speedup while particularly high impact documents could be prevented from burdening the database with extra data.  Re-extracting text from this smaller set of documents would add much less time to the total time to reindex.  I just calculated the amount of space that would be saved by limiting the cached text size to 64kb, and got 1.1GB, which is more than half of the total space currently used.

*If*, of course, there is no decision to reinstate the cache, there's not much point keeping it around.

Changed in karl3:
status:	In Progress → Fix Committed

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2014-07-10: Re: [Bug 1338271] Re: Analyze OSF DB to estimate win by not caching extracted text

#9

Download full text (4.0 KiB)

In theory, to the extent that we care about database size, the extracted_text inflation is more severe in the repozitory database, *if* extracted_text is even part of what gets serialized.

--Paul

On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden> wrote:

> So I wrote a couple of scripts, attached, to take a look at what
> extracted text we have in the database. The most useful way of looking
> at it that I could figure was to look at a histogram that shows
> distribution of objects by their cached data size. Here's the result of
> that analysis:
>
> chris@curiosity:~/proj/karl/dev$ python histogram.py
> total_objects: 70093
> total_bytes: 1.80g
> median size: 7.21k
> distribution:
> 1k: 11389
> 4k: 13862
> 16k: 23052
> 64k: 15637
> 256k: 4958
> 1m: 1134
> 4m: 59
> 16m: 2
>
> What we see here is there are 70k documents that have any extracted text
> at all. Of those, the vast majority are under 64kb with a median of
> about 7kb.
>
> On the whole, this doesn't strike me as too alarming. But, indeed, we
> are loading some extra bytes when unpickling those objects. Since the
> extracted data cache is no longer being used, removing the cached data
> should be a clear win, reducing the total database size by about 1.8GB.
>
> The ticket where the extracted data cache was removed didn't really
> address the point of having the cache to begin with, namely to speed up
> reindexing the text index. This can be pretty useful to have when
> making changes to the text index that require reindexing. Although the
> text index has been pretty stable for the last couple of years there
> are some changes in the pipeline that will probably require reindexing.
> This could be an argument for reinstating the cache. I'm happy to
> recuse myself from that discussion.
>
> *If* the cache were reinstated, there are some strategies for mitigating
> the negative impact of the cache on every day performance:
>
> - The cached text could be made its own persistent object, so it isn't
> loaded from the ZODB and unpickled unless explicitly accessed, which
> should only happen when reindexing a document.
>
> - We could also set an upper limit to the size of the cached text for
> any particular document. For example, if we made the upper limit 64kb,
> the vast majority of documents could still benefit from the reindexing
> speedup while particularly high impact documents could be prevented from
> burdening the database with extra data. Re-extracting text from this
> smaller set of documents would add much less time to the total time to
> reindex. I just calculated the amount of space that would be saved by
> limiting the cached text size to 64kb, and got 1.1GB, which is more than
> half of the total space currently used.
>
> *If*, of course, there is no decision to reinstate the cache, there's
> not much point keeping it around.
>
> ** Changed in: karl3
> Status: In Progress => Fix Committed
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix...

In theory, to the extent that we care about database size, the extracted_text inflation is more severe in the repozitory database, *if* extracted_text is even part of what gets serialized.

--Paul

On Jul 10, 2014, at 10:15 AM, Chris Rossi <chris@archimedeanco.com> wrote:

> So I wrote a couple of scripts, attached, to take a look at what
> extracted text we have in the database.  The most useful way of looking
> at it that I could figure was to look at a histogram that shows
> distribution of objects by their cached data size.  Here's the result of
> that analysis:
> 
> chris@curiosity:~/proj/karl/dev$ python histogram.py 
> total_objects: 70093
> total_bytes: 1.80g
> median size: 7.21k
> distribution:
>  1k: 11389
>  4k: 13862
> 16k: 23052
> 64k: 15637
> 256k: 4958
>  1m: 1134
>  4m: 59
> 16m: 2
> 
> What we see here is there are 70k documents that have any extracted text
> at all.  Of those, the vast majority are under 64kb with a median of
> about 7kb.
> 
> On the whole, this doesn't strike me as too alarming.  But, indeed, we
> are loading some extra bytes when unpickling those objects.  Since the
> extracted data cache is no longer being used, removing the cached data
> should be a clear win, reducing the total database size by about 1.8GB.
> 
> The ticket where the extracted data cache was removed didn't really
> address the point of having the cache to begin with, namely to speed up
> reindexing the text index.  This can be pretty useful to have when
> making changes to the text index that require reindexing.  Although the
> text index has been pretty stable for the last couple of years  there
> are some changes in the pipeline that will probably require reindexing.
> This could be an argument for reinstating the cache.  I'm happy to
> recuse myself from that discussion.
> 
> *If* the cache were reinstated, there are some strategies for mitigating
> the negative impact of the cache on every day performance:
> 
> - The cached text could be made its own persistent object, so it isn't
> loaded from the ZODB and unpickled unless explicitly accessed, which
> should only happen when reindexing a document.
> 
> - We could also set an upper limit to the size of the cached text for
> any particular document.  For example, if we made the upper limit 64kb,
> the vast majority of documents could still benefit from the reindexing
> speedup while particularly high impact documents could be prevented from
> burdening the database with extra data.  Re-extracting text from this
> smaller set of documents would add much less time to the total time to
> reindex.  I just calculated the amount of space that would be saved by
> limiting the cached text size to 64kb, and got 1.1GB, which is more than
> half of the total space currently used.
> 
> *If*, of course, there is no decision to reinstate the cache, there's
> not much point keeping it around.
> 
> ** Changed in: karl3
>       Status: In Progress => Fix Committed
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338271
> 
> Title:
>  Analyze OSF DB to estimate win by not caching extracted text
> 
> Status in KARL3:
>  Fix Committed
> 
> Bug description:
>  At PyCon, Christian was trying to analyze memory spikes and usage for
>  the ZODB cache. We were/are having trouble getting a stable ZODB cache
>  size on object counts. A number that is steady for weeks suddenly
>  spikes.
> 
>  Christian noted that we had some objects in cache that were way too
>  big. Investigation showed that they had the old "extracted text" hack
>  we did, where we keep a copy of the extracted content from
>  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> 
>  For this task, write a console script that does this same analysis, on
>  karlstaging, and gives us an idea of the scale of the problem.
>  Deliberately vague statement of the work, as you need to use some
>  judgement.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-10:

#10

Download full text (5.3 KiB)

It's not.

On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:

> In theory, to the extent that we care about database size, the
> extracted_text inflation is more severe in the repozitory database, *if*
> extracted_text is even part of what gets serialized.
>
> --Paul
>
> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> wrote:
>
> > So I wrote a couple of scripts, attached, to take a look at what
> > extracted text we have in the database. The most useful way of looking
> > at it that I could figure was to look at a histogram that shows
> > distribution of objects by their cached data size. Here's the result of
> > that analysis:
> >
> > chris@curiosity:~/proj/karl/dev$ python histogram.py
> > total_objects: 70093
> > total_bytes: 1.80g
> > median size: 7.21k
> > distribution:
> > 1k: 11389
> > 4k: 13862
> > 16k: 23052
> > 64k: 15637
> > 256k: 4958
> > 1m: 1134
> > 4m: 59
> > 16m: 2
> >
> > What we see here is there are 70k documents that have any extracted text
> > at all. Of those, the vast majority are under 64kb with a median of
> > about 7kb.
> >
> > On the whole, this doesn't strike me as too alarming. But, indeed, we
> > are loading some extra bytes when unpickling those objects. Since the
> > extracted data cache is no longer being used, removing the cached data
> > should be a clear win, reducing the total database size by about 1.8GB.
> >
> > The ticket where the extracted data cache was removed didn't really
> > address the point of having the cache to begin with, namely to speed up
> > reindexing the text index. This can be pretty useful to have when
> > making changes to the text index that require reindexing. Although the
> > text index has been pretty stable for the last couple of years there
> > are some changes in the pipeline that will probably require reindexing.
> > This could be an argument for reinstating the cache. I'm happy to
> > recuse myself from that discussion.
> >
> > *If* the cache were reinstated, there are some strategies for mitigating
> > the negative impact of the cache on every day performance:
> >
> > - The cached text could be made its own persistent object, so it isn't
> > loaded from the ZODB and unpickled unless explicitly accessed, which
> > should only happen when reindexing a document.
> >
> > - We could also set an upper limit to the size of the cached text for
> > any particular document. For example, if we made the upper limit 64kb,
> > the vast majority of documents could still benefit from the reindexing
> > speedup while particularly high impact documents could be prevented from
> > burdening the database with extra data. Re-extracting text from this
> > smaller set of documents would add much less time to the total time to
> > reindex. I just calculated the amount of space that would be saved by
> > limiting the cached text size to 64kb, and got 1.1GB, which is more than
> > half of the total space currently used.
> >
> > *If*, of course, there is no decision to reinstate the cache, there's
> > not much point keeping it around.
> >
> > ** Changed in: karl3
> > Status: In Progress => Fix Committed
> >
> > --
> > You received...

It's not.

On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <paul@agendaless.com> wrote:

> In theory, to the extent that we care about database size, the
> extracted_text inflation is more severe in the repozitory database, *if*
> extracted_text is even part of what gets serialized.
>
> --Paul
>
> On Jul 10, 2014, at 10:15 AM, Chris Rossi <chris@archimedeanco.com>
> wrote:
>
> > So I wrote a couple of scripts, attached, to take a look at what
> > extracted text we have in the database.  The most useful way of looking
> > at it that I could figure was to look at a histogram that shows
> > distribution of objects by their cached data size.  Here's the result of
> > that analysis:
> >
> > chris@curiosity:~/proj/karl/dev$ python histogram.py
> > total_objects: 70093
> > total_bytes: 1.80g
> > median size: 7.21k
> > distribution:
> >  1k: 11389
> >  4k: 13862
> > 16k: 23052
> > 64k: 15637
> > 256k: 4958
> >  1m: 1134
> >  4m: 59
> > 16m: 2
> >
> > What we see here is there are 70k documents that have any extracted text
> > at all.  Of those, the vast majority are under 64kb with a median of
> > about 7kb.
> >
> > On the whole, this doesn't strike me as too alarming.  But, indeed, we
> > are loading some extra bytes when unpickling those objects.  Since the
> > extracted data cache is no longer being used, removing the cached data
> > should be a clear win, reducing the total database size by about 1.8GB.
> >
> > The ticket where the extracted data cache was removed didn't really
> > address the point of having the cache to begin with, namely to speed up
> > reindexing the text index.  This can be pretty useful to have when
> > making changes to the text index that require reindexing.  Although the
> > text index has been pretty stable for the last couple of years  there
> > are some changes in the pipeline that will probably require reindexing.
> > This could be an argument for reinstating the cache.  I'm happy to
> > recuse myself from that discussion.
> >
> > *If* the cache were reinstated, there are some strategies for mitigating
> > the negative impact of the cache on every day performance:
> >
> > - The cached text could be made its own persistent object, so it isn't
> > loaded from the ZODB and unpickled unless explicitly accessed, which
> > should only happen when reindexing a document.
> >
> > - We could also set an upper limit to the size of the cached text for
> > any particular document.  For example, if we made the upper limit 64kb,
> > the vast majority of documents could still benefit from the reindexing
> > speedup while particularly high impact documents could be prevented from
> > burdening the database with extra data.  Re-extracting text from this
> > smaller set of documents would add much less time to the total time to
> > reindex.  I just calculated the amount of space that would be saved by
> > limiting the cached text size to 64kb, and got 1.1GB, which is more than
> > half of the total space currently used.
> >
> > *If*, of course, there is no decision to reinstate the cache, there's
> > not much point keeping it around.
> >
> > ** Changed in: karl3
> >       Status: In Progress => Fix Committed
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1338271
> >
> > Title:
> >  Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> >  Fix Committed
> >
> > Bug description:
> >  At PyCon, Christian was trying to analyze memory spikes and usage for
> >  the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >  size on object counts. A number that is steady for weeks suddenly
> >  spikes.
> >
> >  Christian noted that we had some objects in cache that were way too
> >  big. Investigation showed that they had the old "extracted text" hack
> >  we did, where we keep a copy of the extracted content from
> >  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> >  For this task, write a console script that does this same analysis, on
> >  karlstaging, and gives us an idea of the scale of the problem.
> >  Deliberately vague statement of the work, as you need to use some
> >  judgement.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
>   Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
>   Fix Committed
>
> Bug description:
>   At PyCon, Christian was trying to analyze memory spikes and usage for
>   the ZODB cache. We were/are having trouble getting a stable ZODB cache
>   size on object counts. A number that is steady for weeks suddenly
>   spikes.
>
>   Christian noted that we had some objects in cache that were way too
>   big. Investigation showed that they had the old "extracted text" hack
>   we did, where we keep a copy of the extracted content from
>   HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
>   For this task, write a console script that does this same analysis, on
>   karlstaging, and gives us an idea of the scale of the problem.
>   Deliberately vague statement of the work, as you need to use some
>   judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2014-07-10:

#11

Download full text (6.7 KiB)

I really like the idea of the upper limit at 64k.

Would you prefer to make a new ticket for actually doing something, or modify this ticket.

--Paul

On Jul 10, 2014, at 11:22 AM, Chris Rossi <email address hidden> wrote:

> It's not.
>
>
> On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:
>
>> In theory, to the extent that we care about database size, the
>> extracted_text inflation is more severe in the repozitory database, *if*
>> extracted_text is even part of what gets serialized.
>>
>> --Paul
>>
>> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
>> wrote:
>>
>>> So I wrote a couple of scripts, attached, to take a look at what
>>> extracted text we have in the database. The most useful way of looking
>>> at it that I could figure was to look at a histogram that shows
>>> distribution of objects by their cached data size. Here's the result of
>>> that analysis:
>>>
>>> chris@curiosity:~/proj/karl/dev$ python histogram.py
>>> total_objects: 70093
>>> total_bytes: 1.80g
>>> median size: 7.21k
>>> distribution:
>>> 1k: 11389
>>> 4k: 13862
>>> 16k: 23052
>>> 64k: 15637
>>> 256k: 4958
>>> 1m: 1134
>>> 4m: 59
>>> 16m: 2
>>>
>>> What we see here is there are 70k documents that have any extracted text
>>> at all. Of those, the vast majority are under 64kb with a median of
>>> about 7kb.
>>>
>>> On the whole, this doesn't strike me as too alarming. But, indeed, we
>>> are loading some extra bytes when unpickling those objects. Since the
>>> extracted data cache is no longer being used, removing the cached data
>>> should be a clear win, reducing the total database size by about 1.8GB.
>>>
>>> The ticket where the extracted data cache was removed didn't really
>>> address the point of having the cache to begin with, namely to speed up
>>> reindexing the text index. This can be pretty useful to have when
>>> making changes to the text index that require reindexing. Although the
>>> text index has been pretty stable for the last couple of years there
>>> are some changes in the pipeline that will probably require reindexing.
>>> This could be an argument for reinstating the cache. I'm happy to
>>> recuse myself from that discussion.
>>>
>>> *If* the cache were reinstated, there are some strategies for mitigating
>>> the negative impact of the cache on every day performance:
>>>
>>> - The cached text could be made its own persistent object, so it isn't
>>> loaded from the ZODB and unpickled unless explicitly accessed, which
>>> should only happen when reindexing a document.
>>>
>>> - We could also set an upper limit to the size of the cached text for
>>> any particular document. For example, if we made the upper limit 64kb,
>>> the vast majority of documents could still benefit from the reindexing
>>> speedup while particularly high impact documents could be prevented from
>>> burdening the database with extra data. Re-extracting text from this
>>> smaller set of documents would add much less time to the total time to
>>> reindex. I just calculated the amount of space that would be saved by
>>> limiting the cached text size to 64kb, and got 1.1GB, which is more than
>>> half o...

I really like the idea of the upper limit at 64k.

Would you prefer to make a new ticket for actually doing something, or modify this ticket.

--Paul

On Jul 10, 2014, at 11:22 AM, Chris Rossi <chris@archimedeanco.com> wrote:

> It's not.
> 
> 
> On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <paul@agendaless.com> wrote:
> 
>> In theory, to the extent that we care about database size, the
>> extracted_text inflation is more severe in the repozitory database, *if*
>> extracted_text is even part of what gets serialized.
>> 
>> --Paul
>> 
>> On Jul 10, 2014, at 10:15 AM, Chris Rossi <chris@archimedeanco.com>
>> wrote:
>> 
>>> So I wrote a couple of scripts, attached, to take a look at what
>>> extracted text we have in the database.  The most useful way of looking
>>> at it that I could figure was to look at a histogram that shows
>>> distribution of objects by their cached data size.  Here's the result of
>>> that analysis:
>>> 
>>> chris@curiosity:~/proj/karl/dev$ python histogram.py
>>> total_objects: 70093
>>> total_bytes: 1.80g
>>> median size: 7.21k
>>> distribution:
>>> 1k: 11389
>>> 4k: 13862
>>> 16k: 23052
>>> 64k: 15637
>>> 256k: 4958
>>> 1m: 1134
>>> 4m: 59
>>> 16m: 2
>>> 
>>> What we see here is there are 70k documents that have any extracted text
>>> at all.  Of those, the vast majority are under 64kb with a median of
>>> about 7kb.
>>> 
>>> On the whole, this doesn't strike me as too alarming.  But, indeed, we
>>> are loading some extra bytes when unpickling those objects.  Since the
>>> extracted data cache is no longer being used, removing the cached data
>>> should be a clear win, reducing the total database size by about 1.8GB.
>>> 
>>> The ticket where the extracted data cache was removed didn't really
>>> address the point of having the cache to begin with, namely to speed up
>>> reindexing the text index.  This can be pretty useful to have when
>>> making changes to the text index that require reindexing.  Although the
>>> text index has been pretty stable for the last couple of years  there
>>> are some changes in the pipeline that will probably require reindexing.
>>> This could be an argument for reinstating the cache.  I'm happy to
>>> recuse myself from that discussion.
>>> 
>>> *If* the cache were reinstated, there are some strategies for mitigating
>>> the negative impact of the cache on every day performance:
>>> 
>>> - The cached text could be made its own persistent object, so it isn't
>>> loaded from the ZODB and unpickled unless explicitly accessed, which
>>> should only happen when reindexing a document.
>>> 
>>> - We could also set an upper limit to the size of the cached text for
>>> any particular document.  For example, if we made the upper limit 64kb,
>>> the vast majority of documents could still benefit from the reindexing
>>> speedup while particularly high impact documents could be prevented from
>>> burdening the database with extra data.  Re-extracting text from this
>>> smaller set of documents would add much less time to the total time to
>>> reindex.  I just calculated the amount of space that would be saved by
>>> limiting the cached text size to 64kb, and got 1.1GB, which is more than
>>> half of the total space currently used.
>>> 
>>> *If*, of course, there is no decision to reinstate the cache, there's
>>> not much point keeping it around.
>>> 
>>> ** Changed in: karl3
>>>      Status: In Progress => Fix Committed
>>> 
>>> --
>>> You received this bug notification because you are subscribed to the bug
>>> report.
>>> https://bugs.launchpad.net/bugs/1338271
>>> 
>>> Title:
>>> Analyze OSF DB to estimate win by not caching extracted text
>>> 
>>> Status in KARL3:
>>> Fix Committed
>>> 
>>> Bug description:
>>> At PyCon, Christian was trying to analyze memory spikes and usage for
>>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
>>> size on object counts. A number that is steady for weeks suddenly
>>> spikes.
>>> 
>>> Christian noted that we had some objects in cache that were way too
>>> big. Investigation showed that they had the old "extracted text" hack
>>> we did, where we keep a copy of the extracted content from
>>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>>> 
>>> For this task, write a console script that does this same analysis, on
>>> karlstaging, and gives us an idea of the scale of the problem.
>>> Deliberately vague statement of the work, as you need to use some
>>> judgement.
>>> 
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>> 
>> --
>> You received this bug notification because you are a bug assignee.
>> https://bugs.launchpad.net/bugs/1338271
>> 
>> Title:
>>  Analyze OSF DB to estimate win by not caching extracted text
>> 
>> Status in KARL3:
>>  Fix Committed
>> 
>> Bug description:
>>  At PyCon, Christian was trying to analyze memory spikes and usage for
>>  the ZODB cache. We were/are having trouble getting a stable ZODB cache
>>  size on object counts. A number that is steady for weeks suddenly
>>  spikes.
>> 
>>  Christian noted that we had some objects in cache that were way too
>>  big. Investigation showed that they had the old "extracted text" hack
>>  we did, where we keep a copy of the extracted content from
>>  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>> 
>>  For this task, write a console script that does this same analysis, on
>>  karlstaging, and gives us an idea of the scale of the problem.
>>  Deliberately vague statement of the work, as you need to use some
>>  judgement.
>> 
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>> 
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1338271
> 
> Title:
>  Analyze OSF DB to estimate win by not caching extracted text
> 
> Status in KARL3:
>  Fix Committed
> 
> Bug description:
>  At PyCon, Christian was trying to analyze memory spikes and usage for
>  the ZODB cache. We were/are having trouble getting a stable ZODB cache
>  size on object counts. A number that is steady for weeks suddenly
>  spikes.
> 
>  Christian noted that we had some objects in cache that were way too
>  big. Investigation showed that they had the old "extracted text" hack
>  we did, where we keep a copy of the extracted content from
>  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> 
>  For this task, write a console script that does this same analysis, on
>  karlstaging, and gives us an idea of the scale of the problem.
>  Deliberately vague statement of the work, as you need to use some
>  judgement.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-10:

#12

Download full text (8.3 KiB)

Let's make a new ticket, since it involves not only cleaning up and
changing the way cached data is stored, it involves bringing back the code
that was maintaining and using the cache to begin with.

Chris

On Thu, Jul 10, 2014 at 11:33 AM, Paul Everitt <email address hidden> wrote:

> I really like the idea of the upper limit at 64k.
>
> Would you prefer to make a new ticket for actually doing something, or
> modify this ticket.
>
> --Paul
>
> On Jul 10, 2014, at 11:22 AM, Chris Rossi <email address hidden>
> wrote:
>
> > It's not.
> >
> >
> > On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden>
> wrote:
> >
> >> In theory, to the extent that we care about database size, the
> >> extracted_text inflation is more severe in the repozitory database, *if*
> >> extracted_text is even part of what gets serialized.
> >>
> >> --Paul
> >>
> >> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> >> wrote:
> >>
> >>> So I wrote a couple of scripts, attached, to take a look at what
> >>> extracted text we have in the database. The most useful way of looking
> >>> at it that I could figure was to look at a histogram that shows
> >>> distribution of objects by their cached data size. Here's the result
> of
> >>> that analysis:
> >>>
> >>> chris@curiosity:~/proj/karl/dev$ python histogram.py
> >>> total_objects: 70093
> >>> total_bytes: 1.80g
> >>> median size: 7.21k
> >>> distribution:
> >>> 1k: 11389
> >>> 4k: 13862
> >>> 16k: 23052
> >>> 64k: 15637
> >>> 256k: 4958
> >>> 1m: 1134
> >>> 4m: 59
> >>> 16m: 2
> >>>
> >>> What we see here is there are 70k documents that have any extracted
> text
> >>> at all. Of those, the vast majority are under 64kb with a median of
> >>> about 7kb.
> >>>
> >>> On the whole, this doesn't strike me as too alarming. But, indeed, we
> >>> are loading some extra bytes when unpickling those objects. Since the
> >>> extracted data cache is no longer being used, removing the cached data
> >>> should be a clear win, reducing the total database size by about 1.8GB.
> >>>
> >>> The ticket where the extracted data cache was removed didn't really
> >>> address the point of having the cache to begin with, namely to speed up
> >>> reindexing the text index. This can be pretty useful to have when
> >>> making changes to the text index that require reindexing. Although the
> >>> text index has been pretty stable for the last couple of years there
> >>> are some changes in the pipeline that will probably require reindexing.
> >>> This could be an argument for reinstating the cache. I'm happy to
> >>> recuse myself from that discussion.
> >>>
> >>> *If* the cache were reinstated, there are some strategies for
> mitigating
> >>> the negative impact of the cache on every day performance:
> >>>
> >>> - The cached text could be made its own persistent object, so it isn't
> >>> loaded from the ZODB and unpickled unless explicitly accessed, which
> >>> should only happen when reindexing a document.
> >>>
> >>> - We could also set an upper limit to the size of the cached text for
> >>> any particular document. For example, if we made the upper limit 64kb,
> >>> the vast majority of documents could sti...

Let's make a new ticket, since it involves not only cleaning up and
changing the way cached data is stored, it involves bringing back the code
that was maintaining and using the cache to begin with.

Chris

On Thu, Jul 10, 2014 at 11:33 AM, Paul Everitt <paul@agendaless.com> wrote:

> I really like the idea of the upper limit at 64k.
>
> Would you prefer to make a new ticket for actually doing something, or
> modify this ticket.
>
> --Paul
>
> On Jul 10, 2014, at 11:22 AM, Chris Rossi <chris@archimedeanco.com>
> wrote:
>
> > It's not.
> >
> >
> > On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <paul@agendaless.com>
> wrote:
> >
> >> In theory, to the extent that we care about database size, the
> >> extracted_text inflation is more severe in the repozitory database, *if*
> >> extracted_text is even part of what gets serialized.
> >>
> >> --Paul
> >>
> >> On Jul 10, 2014, at 10:15 AM, Chris Rossi <chris@archimedeanco.com>
> >> wrote:
> >>
> >>> So I wrote a couple of scripts, attached, to take a look at what
> >>> extracted text we have in the database.  The most useful way of looking
> >>> at it that I could figure was to look at a histogram that shows
> >>> distribution of objects by their cached data size.  Here's the result
> of
> >>> that analysis:
> >>>
> >>> chris@curiosity:~/proj/karl/dev$ python histogram.py
> >>> total_objects: 70093
> >>> total_bytes: 1.80g
> >>> median size: 7.21k
> >>> distribution:
> >>> 1k: 11389
> >>> 4k: 13862
> >>> 16k: 23052
> >>> 64k: 15637
> >>> 256k: 4958
> >>> 1m: 1134
> >>> 4m: 59
> >>> 16m: 2
> >>>
> >>> What we see here is there are 70k documents that have any extracted
> text
> >>> at all.  Of those, the vast majority are under 64kb with a median of
> >>> about 7kb.
> >>>
> >>> On the whole, this doesn't strike me as too alarming.  But, indeed, we
> >>> are loading some extra bytes when unpickling those objects.  Since the
> >>> extracted data cache is no longer being used, removing the cached data
> >>> should be a clear win, reducing the total database size by about 1.8GB.
> >>>
> >>> The ticket where the extracted data cache was removed didn't really
> >>> address the point of having the cache to begin with, namely to speed up
> >>> reindexing the text index.  This can be pretty useful to have when
> >>> making changes to the text index that require reindexing.  Although the
> >>> text index has been pretty stable for the last couple of years  there
> >>> are some changes in the pipeline that will probably require reindexing.
> >>> This could be an argument for reinstating the cache.  I'm happy to
> >>> recuse myself from that discussion.
> >>>
> >>> *If* the cache were reinstated, there are some strategies for
> mitigating
> >>> the negative impact of the cache on every day performance:
> >>>
> >>> - The cached text could be made its own persistent object, so it isn't
> >>> loaded from the ZODB and unpickled unless explicitly accessed, which
> >>> should only happen when reindexing a document.
> >>>
> >>> - We could also set an upper limit to the size of the cached text for
> >>> any particular document.  For example, if we made the upper limit 64kb,
> >>> the vast majority of documents could still benefit from the reindexing
> >>> speedup while particularly high impact documents could be prevented
> from
> >>> burdening the database with extra data.  Re-extracting text from this
> >>> smaller set of documents would add much less time to the total time to
> >>> reindex.  I just calculated the amount of space that would be saved by
> >>> limiting the cached text size to 64kb, and got 1.1GB, which is more
> than
> >>> half of the total space currently used.
> >>>
> >>> *If*, of course, there is no decision to reinstate the cache, there's
> >>> not much point keeping it around.
> >>>
> >>> ** Changed in: karl3
> >>>      Status: In Progress => Fix Committed
> >>>
> >>> --
> >>> You received this bug notification because you are subscribed to the
> bug
> >>> report.
> >>> https://bugs.launchpad.net/bugs/1338271
> >>>
> >>> Title:
> >>> Analyze OSF DB to estimate win by not caching extracted text
> >>>
> >>> Status in KARL3:
> >>> Fix Committed
> >>>
> >>> Bug description:
> >>> At PyCon, Christian was trying to analyze memory spikes and usage for
> >>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >>> size on object counts. A number that is steady for weeks suddenly
> >>> spikes.
> >>>
> >>> Christian noted that we had some objects in cache that were way too
> >>> big. Investigation showed that they had the old "extracted text" hack
> >>> we did, where we keep a copy of the extracted content from
> >>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >>>
> >>> For this task, write a console script that does this same analysis, on
> >>> karlstaging, and gives us an idea of the scale of the problem.
> >>> Deliberately vague statement of the work, as you need to use some
> >>> judgement.
> >>>
> >>> To manage notifications about this bug go to:
> >>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
> >>
> >> --
> >> You received this bug notification because you are a bug assignee.
> >> https://bugs.launchpad.net/bugs/1338271
> >>
> >> Title:
> >>  Analyze OSF DB to estimate win by not caching extracted text
> >>
> >> Status in KARL3:
> >>  Fix Committed
> >>
> >> Bug description:
> >>  At PyCon, Christian was trying to analyze memory spikes and usage for
> >>  the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >>  size on object counts. A number that is steady for weeks suddenly
> >>  spikes.
> >>
> >>  Christian noted that we had some objects in cache that were way too
> >>  big. Investigation showed that they had the old "extracted text" hack
> >>  we did, where we keep a copy of the extracted content from
> >>  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >>
> >>  For this task, write a console script that does this same analysis, on
> >>  karlstaging, and gives us an idea of the scale of the problem.
> >>  Deliberately vague statement of the work, as you need to use some
> >>  judgement.
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
> >>
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1338271
> >
> > Title:
> >  Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> >  Fix Committed
> >
> > Bug description:
> >  At PyCon, Christian was trying to analyze memory spikes and usage for
> >  the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >  size on object counts. A number that is steady for weeks suddenly
> >  spikes.
> >
> >  Christian noted that we had some objects in cache that were way too
> >  big. Investigation showed that they had the old "extracted text" hack
> >  we did, where we keep a copy of the extracted content from
> >  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> >  For this task, write a console script that does this same analysis, on
> >  karlstaging, and gives us an idea of the scale of the problem.
> >  Deliberately vague statement of the work, as you need to use some
> >  judgement.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
>   Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
>   Fix Committed
>
> Bug description:
>   At PyCon, Christian was trying to analyze memory spikes and usage for
>   the ZODB cache. We were/are having trouble getting a stable ZODB cache
>   size on object counts. A number that is steady for weeks suddenly
>   spikes.
>
>   Christian noted that we had some objects in cache that were way too
>   big. Investigation showed that they had the old "extracted text" hack
>   we did, where we keep a copy of the extracted content from
>   HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
>   For this task, write a console script that does this same analysis, on
>   karlstaging, and gives us an idea of the scale of the problem.
>   Deliberately vague statement of the work, as you need to use some
>   judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>

Paul Everitt (paul-agendaless) on 2014-07-10

Changed in karl3:
status:	Fix Committed → Fix Released

KARL3

Analyze OSF DB to estimate win by not caching extracted text

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches