bzr search can't index non-ascii text
Bug #383102 reported by
Alexander Belchenko
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
bzr search plugin |
Triaged
|
High
|
Unassigned |
Bug Description
cp1251 strings in indexed content are ignored by the indexing process.
To post a comment you must log in.
status triaged
importance medium
> As you could see search tries to search unicode text in the plain file
> (cp1251 encoded). It's never could match.
So, if the file content was utf8 it would be fine. Is there some way
bzr-search can determine the encoding of the file at the time it indexes
it? I know we can use the BOM for unicode text files. Perhaps there is a
library out there that can do a good job.
bzr-search needs a fixed index it can lookup in quickly, so it needs to
generate unicode terms from the files it indexes. To date its been
pretty simplistic and assumed all content was utf8 :- clearly not
true :).
-ROb