On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:
> In theory, to the extent that we care about database size, the
> extracted_text inflation is more severe in the repozitory database, *if*
> extracted_text is even part of what gets serialized.
>
> --Paul
>
> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> wrote:
>
> > So I wrote a couple of scripts, attached, to take a look at what
> > extracted text we have in the database. The most useful way of looking
> > at it that I could figure was to look at a histogram that shows
> > distribution of objects by their cached data size. Here's the result of
> > that analysis:
> >
> > chris@curiosity:~/proj/karl/dev$ python histogram.py
> > total_objects: 70093
> > total_bytes: 1.80g
> > median size: 7.21k
> > distribution:
> > 1k: 11389
> > 4k: 13862
> > 16k: 23052
> > 64k: 15637
> > 256k: 4958
> > 1m: 1134
> > 4m: 59
> > 16m: 2
> >
> > What we see here is there are 70k documents that have any extracted text
> > at all. Of those, the vast majority are under 64kb with a median of
> > about 7kb.
> >
> > On the whole, this doesn't strike me as too alarming. But, indeed, we
> > are loading some extra bytes when unpickling those objects. Since the
> > extracted data cache is no longer being used, removing the cached data
> > should be a clear win, reducing the total database size by about 1.8GB.
> >
> > The ticket where the extracted data cache was removed didn't really
> > address the point of having the cache to begin with, namely to speed up
> > reindexing the text index. This can be pretty useful to have when
> > making changes to the text index that require reindexing. Although the
> > text index has been pretty stable for the last couple of years there
> > are some changes in the pipeline that will probably require reindexing.
> > This could be an argument for reinstating the cache. I'm happy to
> > recuse myself from that discussion.
> >
> > *If* the cache were reinstated, there are some strategies for mitigating
> > the negative impact of the cache on every day performance:
> >
> > - The cached text could be made its own persistent object, so it isn't
> > loaded from the ZODB and unpickled unless explicitly accessed, which
> > should only happen when reindexing a document.
> >
> > - We could also set an upper limit to the size of the cached text for
> > any particular document. For example, if we made the upper limit 64kb,
> > the vast majority of documents could still benefit from the reindexing
> > speedup while particularly high impact documents could be prevented from
> > burdening the database with extra data. Re-extracting text from this
> > smaller set of documents would add much less time to the total time to
> > reindex. I just calculated the amount of space that would be saved by
> > limiting the cached text size to 64kb, and got 1.1GB, which is more than
> > half of the total space currently used.
> >
> > *If*, of course, there is no decision to reinstate the cache, there's
> > not much point keeping it around.
> >
> > ** Changed in: karl3
> > Status: In Progress => Fix Committed
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1338271
> >
> > Title:
> > Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> > Fix Committed
> >
> > Bug description:
> > At PyCon, Christian was trying to analyze memory spikes and usage for
> > the ZODB cache. We were/are having trouble getting a stable ZODB cache
> > size on object counts. A number that is steady for weeks suddenly
> > spikes.
> >
> > Christian noted that we had some objects in cache that were way too
> > big. Investigation showed that they had the old "extracted text" hack
> > we did, where we keep a copy of the extracted content from
> > HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> > For this task, write a console script that does this same analysis, on
> > karlstaging, and gives us an idea of the scale of the problem.
> > Deliberately vague statement of the work, as you need to use some
> > judgement.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1338271
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> At PyCon, Christian was trying to analyze memory spikes and usage for
> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> size on object counts. A number that is steady for weeks suddenly
> spikes.
>
> Christian noted that we had some objects in cache that were way too
> big. Investigation showed that they had the old "extracted text" hack
> we did, where we keep a copy of the extracted content from
> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
> For this task, write a console script that does this same analysis, on
> karlstaging, and gives us an idea of the scale of the problem.
> Deliberately vague statement of the work, as you need to use some
> judgement.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1338271/+subscriptions
>
It's not.
On Thu, Jul 10, 2014 at 11:05 AM, Paul Everitt <email address hidden> wrote:
> In theory, to the extent that we care about database size, the :~/proj/ karl/dev$ python histogram.py /bugs.launchpad .net/bugs/ 1338271 /bugs.launchpad .net/karl3/ +bug/1338271/ +subscriptions /bugs.launchpad .net/bugs/ 1338271 /bugs.launchpad .net/karl3/ +bug/1338271/ +subscriptions
> extracted_text inflation is more severe in the repozitory database, *if*
> extracted_text is even part of what gets serialized.
>
> --Paul
>
> On Jul 10, 2014, at 10:15 AM, Chris Rossi <email address hidden>
> wrote:
>
> > So I wrote a couple of scripts, attached, to take a look at what
> > extracted text we have in the database. The most useful way of looking
> > at it that I could figure was to look at a histogram that shows
> > distribution of objects by their cached data size. Here's the result of
> > that analysis:
> >
> > chris@curiosity
> > total_objects: 70093
> > total_bytes: 1.80g
> > median size: 7.21k
> > distribution:
> > 1k: 11389
> > 4k: 13862
> > 16k: 23052
> > 64k: 15637
> > 256k: 4958
> > 1m: 1134
> > 4m: 59
> > 16m: 2
> >
> > What we see here is there are 70k documents that have any extracted text
> > at all. Of those, the vast majority are under 64kb with a median of
> > about 7kb.
> >
> > On the whole, this doesn't strike me as too alarming. But, indeed, we
> > are loading some extra bytes when unpickling those objects. Since the
> > extracted data cache is no longer being used, removing the cached data
> > should be a clear win, reducing the total database size by about 1.8GB.
> >
> > The ticket where the extracted data cache was removed didn't really
> > address the point of having the cache to begin with, namely to speed up
> > reindexing the text index. This can be pretty useful to have when
> > making changes to the text index that require reindexing. Although the
> > text index has been pretty stable for the last couple of years there
> > are some changes in the pipeline that will probably require reindexing.
> > This could be an argument for reinstating the cache. I'm happy to
> > recuse myself from that discussion.
> >
> > *If* the cache were reinstated, there are some strategies for mitigating
> > the negative impact of the cache on every day performance:
> >
> > - The cached text could be made its own persistent object, so it isn't
> > loaded from the ZODB and unpickled unless explicitly accessed, which
> > should only happen when reindexing a document.
> >
> > - We could also set an upper limit to the size of the cached text for
> > any particular document. For example, if we made the upper limit 64kb,
> > the vast majority of documents could still benefit from the reindexing
> > speedup while particularly high impact documents could be prevented from
> > burdening the database with extra data. Re-extracting text from this
> > smaller set of documents would add much less time to the total time to
> > reindex. I just calculated the amount of space that would be saved by
> > limiting the cached text size to 64kb, and got 1.1GB, which is more than
> > half of the total space currently used.
> >
> > *If*, of course, there is no decision to reinstate the cache, there's
> > not much point keeping it around.
> >
> > ** Changed in: karl3
> > Status: In Progress => Fix Committed
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https:/
> >
> > Title:
> > Analyze OSF DB to estimate win by not caching extracted text
> >
> > Status in KARL3:
> > Fix Committed
> >
> > Bug description:
> > At PyCon, Christian was trying to analyze memory spikes and usage for
> > the ZODB cache. We were/are having trouble getting a stable ZODB cache
> > size on object counts. A number that is steady for weeks suddenly
> > spikes.
> >
> > Christian noted that we had some objects in cache that were way too
> > big. Investigation showed that they had the old "extracted text" hack
> > we did, where we keep a copy of the extracted content from
> > HTML/Office/PDF etc. to speed up reindexing on evolves etc.
> >
> > For this task, write a console script that does this same analysis, on
> > karlstaging, and gives us an idea of the scale of the problem.
> > Deliberately vague statement of the work, as you need to use some
> > judgement.
> >
> > To manage notifications about this bug go to:
> > https:/
>
> --
> You received this bug notification because you are a bug assignee.
> https:/
>
> Title:
> Analyze OSF DB to estimate win by not caching extracted text
>
> Status in KARL3:
> Fix Committed
>
> Bug description:
> At PyCon, Christian was trying to analyze memory spikes and usage for
> the ZODB cache. We were/are having trouble getting a stable ZODB cache
> size on object counts. A number that is steady for weeks suddenly
> spikes.
>
> Christian noted that we had some objects in cache that were way too
> big. Investigation showed that they had the old "extracted text" hack
> we did, where we keep a copy of the extracted content from
> HTML/Office/PDF etc. to speed up reindexing on evolves etc.
>
> For this task, write a console script that does this same analysis, on
> karlstaging, and gives us an idea of the scale of the problem.
> Deliberately vague statement of the work, as you need to use some
> judgement.
>
> To manage notifications about this bug go to:
> https:/
>