KARL3

Bug #1338271
Comment #10

Comment 10 for bug 1338271

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-10: Re: [Bug 1338271] Re: Analyze OSF DB to estimate win by not caching extracted text

#10

It's not.

On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:

> In theory, to the extent that we care about database size, the
> extracted_text inflation is more severe in the repozitory database, *if*
> extracted_text is even part of what gets serialized.
>
> --Paul
>
> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> wrote:
>
> > So I wrote a couple of scripts, attached, to take a look at what
> > extracted text we have in the database. The most useful way of looking
> > at it that I could figure was to look at a histogram that shows
> > distribution of objects by their cached data size. Here's the result of
> > that analysis:
> >
> > chris@curiosity:~/proj/karl/dev$ python histogram.py
> > total_objects: 70093
> > total_bytes: 1.80g
> > median size: 7.21k
> > distribution:
> > 1k: 11389
> > 4k: 13862
> > 16k: 23052
> > 64k: 15637
> > 256k: 4958
> > 1m: 1134
> > 4m: 59
> > 16m: 2
> >
> > What we see here is there are 70k documents that have any extracted text
> > at all. Of those, the vast majority are under 64kb with a median of
> > about 7kb.
> >
> > On the whole, this doesn't strike me as too alarming. But, indeed, we
> > are loading some extra bytes when unpickling those objects. Since the
> > extracted data cache is no longer being used, removing the cached data
> > should be a clear win, reducing the total database size by about 1.8GB.
> >
> > The ticket where the extracted data cache was removed didn't really
> > address the point of having the cache to begin with, namely to speed up
> > reindexing the text index. This can be pretty useful to have when
> > making changes to the text index that require reindexing. Although the
> > text index has been pretty stable for the last couple of years there
> > are some changes in the pipeline that will probably require reindexing.
> > This could be an argument for reinstating the cache. I'm happy to
> > recuse myself from that discussion.
> >
> > *If* the cache were reinstated, there are some strategies for mitigating
> > the negative impact of the cache on every day performance:
> >
> > - The cached text could be made its own persistent object, so it isn't
> > loaded from the ZODB and unpickled unless explicitly accessed, which
> > should only happen when reindexing a document.
> >
> > - We could also set an upper limit to the size of the cached text for
> > any particular document. For example, if we made the upper limit 64kb,
> > the vast majority of documents could still benefit from the reindexing
> > speedup while particularly high impact documents could be prevented from
> > burdening the database with extra data. Re-extracting text from this
> > smaller set of documents would add much less time to the total time to
> > reindex. I just calculated the amount of space that would be saved by
> > limiting the cached text size to 64kb, and got 1.1GB, which is more than
> > half of the total space currently used.
> >
> > *If*, of course, there is no decision to reinstate the cache, there's
> > not much point keeping it around.
> >
> > ** Changed in: karl3
> > Status: In Progress => Fix Committed
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1338271
> >
> > Title:
> > Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> > Fix Committed
> >
> > Bug description:
> > At PyCon, Christian was trying to analyze memory spikes and usage for
> > the ZODB cache. We were/are having trouble getting a stable ZODB cache
> > size on object counts. A number that is steady for weeks suddenly
> > spikes.
> >
> > Christian noted that we had some objects in cache that were way too
> > big. Investigation showed that they had the old "extracted text" hack
> > we did, where we keep a copy of the extracted content from
> > HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> > For this task, write a console script that does this same analysis, on
> > karlstaging, and gives us an idea of the scale of the problem.
> > Deliberately vague statement of the work, as you need to use some
> > judgement.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> At PyCon, Christian was trying to analyze memory spikes and usage for
> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> size on object counts. A number that is steady for weeks suddenly
> spikes.
>
> Christian noted that we had some objects in cache that were way too
> big. Investigation showed that they had the old "extracted text" hack
> we did, where we keep a copy of the extracted content from
> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
> For this task, write a console script that does this same analysis, on
> karlstaging, and gives us an idea of the scale of the problem.
> Deliberately vague statement of the work, as you need to use some
> judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>

It's not.

On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <paul@agendaless.com> wrote:

> In theory, to the extent that we care about database size, the
> extracted_text inflation is more severe in the repozitory database, *if*
> extracted_text is even part of what gets serialized.
>
> --Paul
>
> On Jul 10, 2014, at 10:15 AM, Chris Rossi <chris@archimedeanco.com>
> wrote:
>
> > So I wrote a couple of scripts, attached, to take a look at what
> > extracted text we have in the database.  The most useful way of looking
> > at it that I could figure was to look at a histogram that shows
> > distribution of objects by their cached data size.  Here's the result of
> > that analysis:
> >
> > chris@curiosity:~/proj/karl/dev$ python histogram.py
> > total_objects: 70093
> > total_bytes: 1.80g
> > median size: 7.21k
> > distribution:
> >  1k: 11389
> >  4k: 13862
> > 16k: 23052
> > 64k: 15637
> > 256k: 4958
> >  1m: 1134
> >  4m: 59
> > 16m: 2
> >
> > What we see here is there are 70k documents that have any extracted text
> > at all.  Of those, the vast majority are under 64kb with a median of
> > about 7kb.
> >
> > On the whole, this doesn't strike me as too alarming.  But, indeed, we
> > are loading some extra bytes when unpickling those objects.  Since the
> > extracted data cache is no longer being used, removing the cached data
> > should be a clear win, reducing the total database size by about 1.8GB.
> >
> > The ticket where the extracted data cache was removed didn't really
> > address the point of having the cache to begin with, namely to speed up
> > reindexing the text index.  This can be pretty useful to have when
> > making changes to the text index that require reindexing.  Although the
> > text index has been pretty stable for the last couple of years  there
> > are some changes in the pipeline that will probably require reindexing.
> > This could be an argument for reinstating the cache.  I'm happy to
> > recuse myself from that discussion.
> >
> > *If* the cache were reinstated, there are some strategies for mitigating
> > the negative impact of the cache on every day performance:
> >
> > - The cached text could be made its own persistent object, so it isn't
> > loaded from the ZODB and unpickled unless explicitly accessed, which
> > should only happen when reindexing a document.
> >
> > - We could also set an upper limit to the size of the cached text for
> > any particular document.  For example, if we made the upper limit 64kb,
> > the vast majority of documents could still benefit from the reindexing
> > speedup while particularly high impact documents could be prevented from
> > burdening the database with extra data.  Re-extracting text from this
> > smaller set of documents would add much less time to the total time to
> > reindex.  I just calculated the amount of space that would be saved by
> > limiting the cached text size to 64kb, and got 1.1GB, which is more than
> > half of the total space currently used.
> >
> > *If*, of course, there is no decision to reinstate the cache, there's
> > not much point keeping it around.
> >
> > ** Changed in: karl3
> >       Status: In Progress => Fix Committed
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1338271
> >
> > Title:
> >  Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> >  Fix Committed
> >
> > Bug description:
> >  At PyCon, Christian was trying to analyze memory spikes and usage for
> >  the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >  size on object counts. A number that is steady for weeks suddenly
> >  spikes.
> >
> >  Christian noted that we had some objects in cache that were way too
> >  big. Investigation showed that they had the old "extracted text" hack
> >  we did, where we keep a copy of the extracted content from
> >  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> >  For this task, write a console script that does this same analysis, on
> >  karlstaging, and gives us an idea of the scale of the problem.
> >  Deliberately vague statement of the work, as you need to use some
> >  judgement.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
>   Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
>   Fix Committed
>
> Bug description:
>   At PyCon, Christian was trying to analyze memory spikes and usage for
>   the ZODB cache. We were/are having trouble getting a stable ZODB cache
>   size on object counts. A number that is steady for weeks suddenly
>   spikes.
>
>   Christian noted that we had some objects in cache that were way too
>   big. Investigation showed that they had the old "extracted text" hack
>   we did, where we keep a copy of the extracted content from
>   HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
>   For this task, write a console script that does this same analysis, on
>   karlstaging, and gives us an idea of the scale of the problem.
>   Deliberately vague statement of the work, as you need to use some
>   judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>