Long browse entries cause index row size error
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
New
|
Undecided
|
Chris Sharp |
Bug Description
EG 2.12
PG 9.4
When trying to ingest a record with an extremely long title, the following error occurs:
DBD::Pg::st execute failed: ERROR: index row size 3000 exceeds maximum 2712 for index "browse_
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
CONTEXT: SQL statement "INSERT INTO metabib.
PL/pgSQL function metabib.
SQL statement "SELECT metabib.
PL/pgSQL function biblio.
browse_
Perhaps a simple truncation of the data before indexing would be sufficient. It has the same incompleteness problem as using MD5, but at least we know we are punishing the outliers, not entries based on random chance. Even with truncation, it seems extremely unlikely for a title to have its *first* uniqueness from the set more than 2,000 characters in (if that is what the 'size' here amounts to).
tags: | added: database |
Changed in evergreen: | |
assignee: | nobody → Chris Sharp (chrissharp123) |
This isn't the first time we've run into this sort of issue. For another example, see: mfr. We truncate there (transparently, through a view and query rewriting), FWIW.
The reason we have a unique index over (sort_value, value) rather than just (value) or (sort_value) alone is because of non-filing prefixes. Two strings may have different value, er, values, but if they come from fields that allows and use non-filing characters, they may have colliding sort_value values. And, likewise, you may have to identical value values but one defines a non-filing character count and so should sort differently. So, both the raw value and the sort_value must be considered together when defining "unique".
Moving to a calculated value (MD5, SHA1, etc) would be non-good, since the index is used for searching and sorting the data.
Truncation seems reasonable, and, IMO, favoring truncation of the value column is probably preferable up to some minimum length. For instance, perhaps we truncate the sort_value column to no shorter than 2000, and give the balance of the ~2700 bytes to the value column. That allows the non-filing part of the value column to continue acting as a tie breaker, which is how things work today.
Thoughts?