Comment 15 for bug 1172106

Revision history for this message
John Dennis (jdennis-a) wrote :

The test with the CPython module was more for informational purposes at the moment. What we wanted to confirm was my original supposition it was the default encoding of ASCII that was the culprit and changing the default encoding to UTF-8 would solve the problem. We've now confirmed that, that's good. Now the issue how to reset the default encoding.

With the Python installations I'm familiar with the default encoding is set in site.py and then sys.setdefaultencoding is removed from sys to prevent it from being reset. I have a long write up on the issues if you're interested, but unfortunately it's not in a public blog at the moment.

There is a trick for reloading the sys module and getting the sys.setdefaultencoding function back. That does not require a C module. FWIW the C module works because you can always reset the default encoding from within the Python interpreter.

The reload(sys) trick is simple and does not require a C module, here's an example

$ python
Python 2.7.3 (default, Aug 9 2012, 17:23:57)
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'
>>>

In other project we use the C module approach because we were distrustful of any side-effects of reloading sys, but to be honest I don't know of any side-effects, we were just being (overly?) careful.

FWIW, if the C module were on pypi getting it installed and compiled with OpenStack would probably be trivial (or at least I think).

But for sure we can use the reload(sys) trick too.

Just two words of caution

1) Changing the default encoding affects everything running in the interpreter, that means every piece of Python code. I'm not against this because using ASCII is just plain wrong, it really should be UTF-8. Changing it globally to UTF-8 will probably fix any number of lurking issues. But on the other hand there is a possibility some code exists which depends on ASCII as the default encoding and it might break something far removed, but I really doubt anything would be depending on ASCII as the default encoding for proper operation and if it was it's wrong IMHO. Besides ASCII is a proper subset of UTF-8.

2) The default encoding needs to be reset as early as possible when modules load (best if it's first). Why? Because in Python2 strings cache the result of the default encoding conversion. Once cached when the string is referenced it simply reuses the cached encoding. If you switch the default encoding previously cached strings will reference encoded values from the previous default encoding setting. This is why they remove sys.setdefaultencoding() in site.py after they set the default encoding, otherwise you will have inconsistent cached encodings.