solr treats diacriticals as word breaks
Bug #389217 reported by
solrize
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Open Library |
Confirmed
|
Low
|
Edward Betts |
Bug Description
Found by Karen:
Perhaps I'd like to find books about astronauts who practice eastern religion:
http://
This instead finds a bunch of Indian names that contain the letters "nasa" starting in the middle of a word, but preceded by an accented letter. Need to check that we're using the right solr input tokenizer. Unicode normalization may also figure into this.
Changed in openlibrary: | |
assignee: | nobody → solrize (solrize) |
importance: | Undecided → Low |
status: | New → Confirmed |
To post a comment you must log in.
at least two other tokenizing problems:
subject headings with ampersands, e.g. "sports & recreation", retrieve zero when clicked on. Note that there are headings with ampersands but no surrounding spaces ("Sports& Recreation" ), and these can be retrieved by putting together the two words without the ampersand ("sportsrecreat ion"). This latter does not retrieve the ones with spaces around the ampersand.
some subject headings with slashes have this same problem, e.g. "Children's Books/Ages 4-8 Fiction". However, others, e.g. "Health/Fitness" work fine. The search "healthfitness" retrieves books with "health/fitness".
totally unclear to me how solr tokenizes.