KARL3

Bug #1338271
Comment #12

Comment 12 for bug 1338271

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2014-07-10: Re: [Bug 1338271] Re: Analyze OSF DB to estimate win by not caching extracted text

#12

Let's make a new ticket, since it involves not only cleaning up and
changing the way cached data is stored, it involves bringing back the code
that was maintaining and using the cache to begin with.

Chris

On Thu, Jul 10, 2014 at 11:33 AM, Paul Everitt <email address hidden> wrote:

> I really like the idea of the upper limit at 64k.
>
> Would you prefer to make a new ticket for actually doing something, or
> modify this ticket.
>
> --Paul
>
> On Jul 10, 2014, at 11:22 AM, Chris Rossi <email address hidden>
> wrote:
>
> > It's not.
> >
> >
> > On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden>
> wrote:
> >
> >> In theory, to the extent that we care about database size, the
> >> extracted_text inflation is more severe in the repozitory database, *if*
> >> extracted_text is even part of what gets serialized.
> >>
> >> --Paul
> >>
> >> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> >> wrote:
> >>
> >>> So I wrote a couple of scripts, attached, to take a look at what
> >>> extracted text we have in the database. The most useful way of looking
> >>> at it that I could figure was to look at a histogram that shows
> >>> distribution of objects by their cached data size. Here's the result
> of
> >>> that analysis:
> >>>
> >>> chris@curiosity:~/proj/karl/dev$ python histogram.py
> >>> total_objects: 70093
> >>> total_bytes: 1.80g
> >>> median size: 7.21k
> >>> distribution:
> >>> 1k: 11389
> >>> 4k: 13862
> >>> 16k: 23052
> >>> 64k: 15637
> >>> 256k: 4958
> >>> 1m: 1134
> >>> 4m: 59
> >>> 16m: 2
> >>>
> >>> What we see here is there are 70k documents that have any extracted
> text
> >>> at all. Of those, the vast majority are under 64kb with a median of
> >>> about 7kb.
> >>>
> >>> On the whole, this doesn't strike me as too alarming. But, indeed, we
> >>> are loading some extra bytes when unpickling those objects. Since the
> >>> extracted data cache is no longer being used, removing the cached data
> >>> should be a clear win, reducing the total database size by about 1.8GB.
> >>>
> >>> The ticket where the extracted data cache was removed didn't really
> >>> address the point of having the cache to begin with, namely to speed up
> >>> reindexing the text index. This can be pretty useful to have when
> >>> making changes to the text index that require reindexing. Although the
> >>> text index has been pretty stable for the last couple of years there
> >>> are some changes in the pipeline that will probably require reindexing.
> >>> This could be an argument for reinstating the cache. I'm happy to
> >>> recuse myself from that discussion.
> >>>
> >>> *If* the cache were reinstated, there are some strategies for
> mitigating
> >>> the negative impact of the cache on every day performance:
> >>>
> >>> - The cached text could be made its own persistent object, so it isn't
> >>> loaded from the ZODB and unpickled unless explicitly accessed, which
> >>> should only happen when reindexing a document.
> >>>
> >>> - We could also set an upper limit to the size of the cached text for
> >>> any particular document. For example, if we made the upper limit 64kb,
> >>> the vast majority of documents could still benefit from the reindexing
> >>> speedup while particularly high impact documents could be prevented
> from
> >>> burdening the database with extra data. Re-extracting text from this
> >>> smaller set of documents would add much less time to the total time to
> >>> reindex. I just calculated the amount of space that would be saved by
> >>> limiting the cached text size to 64kb, and got 1.1GB, which is more
> than
> >>> half of the total space currently used.
> >>>
> >>> *If*, of course, there is no decision to reinstate the cache, there's
> >>> not much point keeping it around.
> >>>
> >>> ** Changed in: karl3
> >>> Status: In Progress => Fix Committed
> >>>
> >>> --
> >>> You received this bug notification because you are subscribed to the
> bug
> >>> report.
> >>> https://bugs.launchpad.net/bugs/1338271
> >>>
> >>> Title:
> >>> Analyze OSF DB to estimate win by not caching extracted text
> >>>
> >>> Status in KARL3:
> >>> Fix Committed
> >>>
> >>> Bug description:
> >>> At PyCon, Christian was trying to analyze memory spikes and usage for
> >>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >>> size on object counts. A number that is steady for weeks suddenly
> >>> spikes.
> >>>
> >>> Christian noted that we had some objects in cache that were way too
> >>> big. Investigation showed that they had the old "extracted text" hack
> >>> we did, where we keep a copy of the extracted content from
> >>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >>>
> >>> For this task, write a console script that does this same analysis, on
> >>> karlstaging, and gives us an idea of the scale of the problem.
> >>> Deliberately vague statement of the work, as you need to use some
> >>> judgement.
> >>>
> >>> To manage notifications about this bug go to:
> >>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
> >>
> >> --
> >> You received this bug notification because you are a bug assignee.
> >> https://bugs.launchpad.net/bugs/1338271
> >>
> >> Title:
> >> Analyze OSF DB to estimate win by not caching extracted text
> >>
> >> Status in KARL3:
> >> Fix Committed
> >>
> >> Bug description:
> >> At PyCon, Christian was trying to analyze memory spikes and usage for
> >> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >> size on object counts. A number that is steady for weeks suddenly
> >> spikes.
> >>
> >> Christian noted that we had some objects in cache that were way too
> >> big. Investigation showed that they had the old "extracted text" hack
> >> we did, where we keep a copy of the extracted content from
> >> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >>
> >> For this task, write a console script that does this same analysis, on
> >> karlstaging, and gives us an idea of the scale of the problem.
> >> Deliberately vague statement of the work, as you need to use some
> >> judgement.
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
> >>
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1338271
> >
> > Title:
> > Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> > Fix Committed
> >
> > Bug description:
> > At PyCon, Christian was trying to analyze memory spikes and usage for
> > the ZODB cache. We were/are having trouble getting a stable ZODB cache
> > size on object counts. A number that is steady for weeks suddenly
> > spikes.
> >
> > Christian noted that we had some objects in cache that were way too
> > big. Investigation showed that they had the old "extracted text" hack
> > we did, where we keep a copy of the extracted content from
> > HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> > For this task, write a console script that does this same analysis, on
> > karlstaging, and gives us an idea of the scale of the problem.
> > Deliberately vague statement of the work, as you need to use some
> > judgement.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> At PyCon, Christian was trying to analyze memory spikes and usage for
> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> size on object counts. A number that is steady for weeks suddenly
> spikes.
>
> Christian noted that we had some objects in cache that were way too
> big. Investigation showed that they had the old "extracted text" hack
> we did, where we keep a copy of the extracted content from
> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
> For this task, write a console script that does this same analysis, on
> karlstaging, and gives us an idea of the scale of the problem.
> Deliberately vague statement of the work, as you need to use some
> judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>

Let's make a new ticket, since it involves not only cleaning up and
changing the way cached data is stored, it involves bringing back the code
that was maintaining and using the cache to begin with.

Chris

On Thu, Jul 10, 2014 at 11:33 AM, Paul Everitt <paul@agendaless.com> wrote:

> I really like the idea of the upper limit at 64k.
>
> Would you prefer to make a new ticket for actually doing something, or
> modify this ticket.
>
> --Paul
>
> On Jul 10, 2014, at 11:22 AM, Chris Rossi <chris@archimedeanco.com>
> wrote:
>
> > It's not.
> >
> >
> > On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <paul@agendaless.com>
> wrote:
> >
> >> In theory, to the extent that we care about database size, the
> >> extracted_text inflation is more severe in the repozitory database, *if*
> >> extracted_text is even part of what gets serialized.
> >>
> >> --Paul
> >>
> >> On Jul 10, 2014, at 10:15 AM, Chris Rossi <chris@archimedeanco.com>
> >> wrote:
> >>
> >>> So I wrote a couple of scripts, attached, to take a look at what
> >>> extracted text we have in the database.  The most useful way of looking
> >>> at it that I could figure was to look at a histogram that shows
> >>> distribution of objects by their cached data size.  Here's the result
> of
> >>> that analysis:
> >>>
> >>> chris@curiosity:~/proj/karl/dev$ python histogram.py
> >>> total_objects: 70093
> >>> total_bytes: 1.80g
> >>> median size: 7.21k
> >>> distribution:
> >>> 1k: 11389
> >>> 4k: 13862
> >>> 16k: 23052
> >>> 64k: 15637
> >>> 256k: 4958
> >>> 1m: 1134
> >>> 4m: 59
> >>> 16m: 2
> >>>
> >>> What we see here is there are 70k documents that have any extracted
> text
> >>> at all.  Of those, the vast majority are under 64kb with a median of
> >>> about 7kb.
> >>>
> >>> On the whole, this doesn't strike me as too alarming.  But, indeed, we
> >>> are loading some extra bytes when unpickling those objects.  Since the
> >>> extracted data cache is no longer being used, removing the cached data
> >>> should be a clear win, reducing the total database size by about 1.8GB.
> >>>
> >>> The ticket where the extracted data cache was removed didn't really
> >>> address the point of having the cache to begin with, namely to speed up
> >>> reindexing the text index.  This can be pretty useful to have when
> >>> making changes to the text index that require reindexing.  Although the
> >>> text index has been pretty stable for the last couple of years  there
> >>> are some changes in the pipeline that will probably require reindexing.
> >>> This could be an argument for reinstating the cache.  I'm happy to
> >>> recuse myself from that discussion.
> >>>
> >>> *If* the cache were reinstated, there are some strategies for
> mitigating
> >>> the negative impact of the cache on every day performance:
> >>>
> >>> - The cached text could be made its own persistent object, so it isn't
> >>> loaded from the ZODB and unpickled unless explicitly accessed, which
> >>> should only happen when reindexing a document.
> >>>
> >>> - We could also set an upper limit to the size of the cached text for
> >>> any particular document.  For example, if we made the upper limit 64kb,
> >>> the vast majority of documents could still benefit from the reindexing
> >>> speedup while particularly high impact documents could be prevented
> from
> >>> burdening the database with extra data.  Re-extracting text from this
> >>> smaller set of documents would add much less time to the total time to
> >>> reindex.  I just calculated the amount of space that would be saved by
> >>> limiting the cached text size to 64kb, and got 1.1GB, which is more
> than
> >>> half of the total space currently used.
> >>>
> >>> *If*, of course, there is no decision to reinstate the cache, there's
> >>> not much point keeping it around.
> >>>
> >>> ** Changed in: karl3
> >>>      Status: In Progress => Fix Committed
> >>>
> >>> --
> >>> You received this bug notification because you are subscribed to the
> bug
> >>> report.
> >>> https://bugs.launchpad.net/bugs/1338271
> >>>
> >>> Title:
> >>> Analyze OSF DB to estimate win by not caching extracted text
> >>>
> >>> Status in KARL3:
> >>> Fix Committed
> >>>
> >>> Bug description:
> >>> At PyCon, Christian was trying to analyze memory spikes and usage for
> >>> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >>> size on object counts. A number that is steady for weeks suddenly
> >>> spikes.
> >>>
> >>> Christian noted that we had some objects in cache that were way too
> >>> big. Investigation showed that they had the old "extracted text" hack
> >>> we did, where we keep a copy of the extracted content from
> >>> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >>>
> >>> For this task, write a console script that does this same analysis, on
> >>> karlstaging, and gives us an idea of the scale of the problem.
> >>> Deliberately vague statement of the work, as you need to use some
> >>> judgement.
> >>>
> >>> To manage notifications about this bug go to:
> >>> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
> >>
> >> --
> >> You received this bug notification because you are a bug assignee.
> >> https://bugs.launchpad.net/bugs/1338271
> >>
> >> Title:
> >>  Analyze OSF DB to estimate win by not caching extracted text
> >>
> >> Status in KARL3:
> >>  Fix Committed
> >>
> >> Bug description:
> >>  At PyCon, Christian was trying to analyze memory spikes and usage for
> >>  the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >>  size on object counts. A number that is steady for weeks suddenly
> >>  spikes.
> >>
> >>  Christian noted that we had some objects in cache that were way too
> >>  big. Investigation showed that they had the old "extracted text" hack
> >>  we did, where we keep a copy of the extracted content from
> >>  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >>
> >>  For this task, write a console script that does this same analysis, on
> >>  karlstaging, and gives us an idea of the scale of the problem.
> >>  Deliberately vague statement of the work, as you need to use some
> >>  judgement.
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
> >>
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1338271
> >
> > Title:
> >  Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> >  Fix Committed
> >
> > Bug description:
> >  At PyCon, Christian was trying to analyze memory spikes and usage for
> >  the ZODB cache. We were/are having trouble getting a stable ZODB cache
> >  size on object counts. A number that is steady for weeks suddenly
> >  spikes.
> >
> >  Christian noted that we had some objects in cache that were way too
> >  big. Investigation showed that they had the old "extracted text" hack
> >  we did, where we keep a copy of the extracted content from
> >  HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> >  For this task, write a console script that does this same analysis, on
> >  karlstaging, and gives us an idea of the scale of the problem.
> >  Deliberately vague statement of the work, as you need to use some
> >  judgement.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
>   Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
>   Fix Committed
>
> Bug description:
>   At PyCon, Christian was trying to analyze memory spikes and usage for
>   the ZODB cache. We were/are having trouble getting a stable ZODB cache
>   size on object counts. A number that is steady for weeks suddenly
>   spikes.
>
>   Christian noted that we had some objects in cache that were way too
>   big. Investigation showed that they had the old "extracted text" hack
>   we did, where we keep a copy of the extracted content from
>   HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
>   For this task, write a console script that does this same analysis, on
>   karlstaging, and gives us an idea of the scale of the problem.
>   Deliberately vague statement of the work, as you need to use some
>   judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>