Open Library

solr treats diacriticals as word breaks

Bug #389217 reported by solrize on 2009-06-18

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Open Library	Confirmed	Low	Edward Betts

Bug Description

Found by Karen:

Perhaps I'd like to find books about astronauts who practice eastern religion:

http://openlibrary.org/search?q=nasa+jainism

This instead finds a bunch of Indian names that contain the letters "nasa" starting in the middle of a word, but preceded by an accented letter. Need to check that we're using the right solr input tokenizer. Unicode normalization may also figure into this.

solrize (solrize) on 2009-06-18

Changed in openlibrary:
assignee:	nobody → solrize (solrize)
importance:	Undecided → Low
status:	New → Confirmed

Revision history for this message

Karen Coyle (kcoyle) wrote on 2009-06-19:

#1

at least two other tokenizing problems:

subject headings with ampersands, e.g. "sports & recreation", retrieve zero when clicked on. Note that there are headings with ampersands but no surrounding spaces ("Sports&Recreation"), and these can be retrieved by putting together the two words without the ampersand ("sportsrecreation"). This latter does not retrieve the ones with spaces around the ampersand.

some subject headings with slashes have this same problem, e.g. "Children's Books/Ages 4-8 Fiction". However, others, e.g. "Health/Fitness" work fine. The search "healthfitness" retrieves books with "health/fitness".

totally unclear to me how solr tokenizes.

Revision history for this message

solrize (solrize) wrote on 2009-06-19:

#2

The issue with ampersands in those subject links is unrelated, it's caused by thingrepr making unicode with escaped entities, which hash into different facet tokens than the unescaped versions, so the links don't find anything. That is discussed in #378841.

Revision history for this message

Karen Coyle (kcoyle) wrote on 2009-06-19:

#3

How solr tokenizes: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters -- the part about whitespace is pretty clear, but I'm still not sure what solr does with slashes or ampersands.

One of the problems could be if we are using the de-composed Unicode forms, and solr expects pre-composed. So we would have a letter followed by an accent character, rather than have the two combined in a single unicode character. We may want to switch to pre-composed if that is the case.

Pre-composed could present some problems for transliterations (which is what we see with the nasa+jainism case) -- sometimes there isn't a pre-composed equivalent because the transliterations are artificial. But we'd probably still get better search results for most cases.

Revision history for this message

Karen Coyle (kcoyle) wrote on 2009-06-30:

#4

Note: Edward already has on his list to re-normalize the data to switch to pre-composed unicode, so that will solve the nara + jainism problem.

Revision history for this message

solrize (solrize) wrote on 2009-06-30:

#5

Karen, the issue with the ampersands in the subject links is discussed in bug #378841. It has absolutely nothing to do with solr tokenization.

I agree that switching normalization will help with these diacriticals and probably with some other issues.

Revision history for this message

George (george-archive) wrote on 2010-02-17:

#6

Edward - thoughts?

If needed, keep open.

Changed in openlibrary:
assignee:	solrize (solrize) → Edward Betts (edwardbetts)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.