Ubuntu Translations

transliterate text/use collation before adding to xapian db and when searching

Bug #744914 reported by Lucian Adrian Grijincu on 2011-03-29

This bug affects 1 person

	Status	Importance	Assigned to
Ubuntu Translations	New	Undecided	Unassigned
software-center (Ubuntu)	Triaged	Medium	Unassigned
Precise	Won't Fix	Medium	Unassigned

Bug Description

Binary package hint: software-center

As of now software center uses str.lower() when searching in the xapian db:

utils/query.py
22: s = search_term.lower()
33: query = xapian.Query(str_to_prefix[search_prefix]+search_term.lower())

There are two problems with this:
* many languages have diacritic marks for characters but for fast typing users usually write the base character: (in Romanian: ăâșțî and ĂÂȘȚÎ are spelled AASTI by some users).

* characters in the Unicode set can appear in two forms: composed and decomposed: the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).

To solve both problems both the text entered in the xapian db and the user's text query must be normalized.

The search function in Chromium uses ICU rules to achieve this:
- http://code.google.com/p/chromium/issues/detail?id=1100
- http://www.google.com/codesearch/p?hl=en#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/editing/TextIterator.cpp&q=file:TextIterator.cpp&l=1882

There is a python-icu library that could help achieve this. See for example http://lists.osafoundation.org/pipermail/pyicu-dev/2010-October/000214.html

Or one could just remove the diacritical marks from the string altogether: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string

See original description

Tags:

Lucian Adrian Grijincu (lucian.grijincu) on 2011-03-29

description:

updated

Revision history for this message

Matthew Paul Thomas (mpt) wrote on 2011-09-21:

This looks like a reasonable suggestion. Can you give an example of a search that would produce better results if this was implemented? That would help in prioritizing it.

Kiwinote (kiwinote) on 2011-09-27

tags:

added: db

Michael Vogt (mvo) on 2011-10-07

Changed in software-center (Ubuntu):
status:	New → Confirmed
importance:	Undecided → Medium
Changed in software-center (Ubuntu Precise):
status:	New → Confirmed
importance:	Undecided → Medium

Pedro Villavicencio (pedro) on 2011-11-14

Changed in software-center (Ubuntu Precise):
status:	Confirmed → Triaged

Revision history for this message

Steve Langasek (vorlon) wrote on 2021-10-14:

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in software-center (Ubuntu Precise):
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.