Comment 1 for bug 57985

Revision history for this message
David Allouche (ddaa) wrote :

As discussed with LartiQ in #bzr, I think bzr-gtk should be smarter at guessing how to decode arbitrary file contents. Generally, the logic should look like:

1. Look for a BOM. If we find one, we can be confident that the encoding is utf-something. BOM are normally found in utf-16 and utf-32 files, but LartiQ reports that it's sometimes used in utf-8 documents as well (although it makes no sense, since utf-8 fixes the bit ordering).

2. Try decoding with utf-8. I do not know of any encoding/language that normally (in non-pathological documents) produce data that is valid utf-8.

3. Optionally, more heuristics. Some text editors looks for patterns in the document to guess the encoding. I believe emacs has some magic of that sort.

4. Try the locale encoding, as provided by sys.getpreferredencoding (per j-a-meinel)

5. If that still does not work, decode('ascii', 'replace').

FINALLY: always display a control that shows the encoding, and provides direct user control to override the automatic detection. The choices should at least include utf-8, the locale encoding, and explicit input of any arbitrary encoding supported by Python. Optionally, the choices could include a list of user-configurable favourite encodings.