be smart at guessing encoding
Bug #57985 reported by
David Allouche
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Bazaar GTK+ Frontends |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Bug 44677 is fixed by decode with errors=replace. But the gui could be smarter and more helpful when dealing with non-utf8 encodings.
Changed in bzr-gtk: | |
importance: | Untriaged → Medium |
status: | Unconfirmed → Confirmed |
Changed in bzrk: | |
importance: | Untriaged → Medium |
status: | Unconfirmed → Confirmed |
tags: | added: diff encoding |
To post a comment you must log in.
As discussed with LartiQ in #bzr, I think bzr-gtk should be smarter at guessing how to decode arbitrary file contents. Generally, the logic should look like:
1. Look for a BOM. If we find one, we can be confident that the encoding is utf-something. BOM are normally found in utf-16 and utf-32 files, but LartiQ reports that it's sometimes used in utf-8 documents as well (although it makes no sense, since utf-8 fixes the bit ordering).
2. Try decoding with utf-8. I do not know of any encoding/language that normally (in non-pathological documents) produce data that is valid utf-8.
3. Optionally, more heuristics. Some text editors looks for patterns in the document to guess the encoding. I believe emacs has some magic of that sort.
4. Try the locale encoding, as provided by sys.getpreferre dencoding (per j-a-meinel)
5. If that still does not work, decode('ascii', 'replace').
FINALLY: always display a control that shows the encoding, and provides direct user control to override the automatic detection. The choices should at least include utf-8, the locale encoding, and explicit input of any arbitrary encoding supported by Python. Optionally, the choices could include a list of user-configurable favourite encodings.