solr treats diacriticals as word breaks

Bug #389217 reported by solrize
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Open Library
Confirmed
Low
Edward Betts

Bug Description

Found by Karen:

Perhaps I'd like to find books about astronauts who practice eastern religion:

http://openlibrary.org/search?q=nasa+jainism

This instead finds a bunch of Indian names that contain the letters "nasa" starting in the middle of a word, but preceded by an accented letter. Need to check that we're using the right solr input tokenizer. Unicode normalization may also figure into this.

solrize (solrize)
Changed in openlibrary:
assignee: nobody → solrize (solrize)
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Karen Coyle (kcoyle) wrote :

at least two other tokenizing problems:

subject headings with ampersands, e.g. "sports & recreation", retrieve zero when clicked on. Note that there are headings with ampersands but no surrounding spaces ("Sports&Recreation"), and these can be retrieved by putting together the two words without the ampersand ("sportsrecreation"). This latter does not retrieve the ones with spaces around the ampersand.

some subject headings with slashes have this same problem, e.g. "Children's Books/Ages 4-8 Fiction". However, others, e.g. "Health/Fitness" work fine. The search "healthfitness" retrieves books with "health/fitness".

totally unclear to me how solr tokenizes.

Revision history for this message
solrize (solrize) wrote :

The issue with ampersands in those subject links is unrelated, it's caused by thingrepr making unicode with escaped entities, which hash into different facet tokens than the unescaped versions, so the links don't find anything. That is discussed in #378841.

Revision history for this message
Karen Coyle (kcoyle) wrote :

How solr tokenizes: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters -- the part about whitespace is pretty clear, but I'm still not sure what solr does with slashes or ampersands.

One of the problems could be if we are using the de-composed Unicode forms, and solr expects pre-composed. So we would have a letter followed by an accent character, rather than have the two combined in a single unicode character. We may want to switch to pre-composed if that is the case.

Pre-composed could present some problems for transliterations (which is what we see with the nasa+jainism case) -- sometimes there isn't a pre-composed equivalent because the transliterations are artificial. But we'd probably still get better search results for most cases.

Revision history for this message
Karen Coyle (kcoyle) wrote :

Note: Edward already has on his list to re-normalize the data to switch to pre-composed unicode, so that will solve the nara + jainism problem.

Revision history for this message
solrize (solrize) wrote :

Karen, the issue with the ampersands in the subject links is discussed in bug #378841. It has absolutely nothing to do with solr tokenization.

I agree that switching normalization will help with these diacriticals and probably with some other issues.

Revision history for this message
George (george-archive) wrote :

Edward - thoughts?

If needed, keep open.

Changed in openlibrary:
assignee: solrize (solrize) → Edward Betts (edwardbetts)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.