The short story is this is failing because python-ldap is not unicode safe by default.

Some more details.

Python unicode objects are UCS-4, thus a character is 32-bits wide and uses the Unicode code points directly. libldap which python-ldap links against expects an single octet string encoded in UTF-8. Somewhere between the 4 octet characters in the unicode object and the libldap library calls the UCS-4 characters have to be encoded into UTF-8. There are a number of ways this can happen.

1) Set the Python default-encoding to UTF-8 (by default it's ASCII)

2) Explicitly perform the utf-8 encoding every time you call ldap (or for that matter any external library with a Python binding which expects UTF-8)

3) Perform the UTF-8 encoding in the ldap Python binding.

But here are the issues:

Option 3 is the ideal, the binding should take care of this, however most Python extension bindings do not do the right thing by explicitly performing the encoding. Instead many extension bindings rely on Python's argument passing mechanism to perform the encoding for them. However the built-in encoding is controlled by Python's default-encoding value. Which by default is ASCII not UTF-8. To make matters worse the default-encoding is set in site.py, a site local file, site.py is read very early on and then the default-encoding is locked, you can't change it. There is a long and sordid history to this  the details of which I'll spare you.

Since the extension binding falls back to the built-in encoding mechanism and the unmodifiable default-encoding on many distributions is ASCII there is an attempt to encode a unicode code point into the 7-bit ASCII range. This of course fails and throws the exception. If the default-encoding had been UTF-8 it would have worked wonderfully.

It's better to perform the encoding in the extension binding because only the binding knows what needs to be passed to the underlying C library it's calling into. UTF-8 is the norm in Linux/UNIX.

Option 2, explicitly encode/decode around every library call is ugly and error prone. It's best avoided if possible. However you see a lot of Python code which does this, usually because they couldn't figure out a viable alternative.

Option 1 is the best. If you set the default-encoding to UTF-8 in Python there is no need to do any explicit encoding, Python will do the right thing in 99.9% of the cases. However you still need to decode back into unicode objects when receiving strings from library calls.

The way we've solved the default encoding problem in the past is with a trivial extension module which we load before any other modules which sets the default-encoding to UTF-8. There are other tricks such as reloading site.py to get around the locking issue.

The short story is Python 2 is very screwed up with respect to Unicode strings, str objects, default encodings etc. Part of the problem is many of these features were added to Python 2 later with all the compatibility issues such an approach engenders. By far the worst decision was making the default encoding ASCII. But the good news is that Python 3 has cleaned up this mess and the problems pretty much go away in Python 3.

My recommendation is that we override the default-encoding and set it to UTF-8 as one of the first modules loaded. Whether this done via an extension module or via the reload hack is open, I prefer the extension module.

BTW, some extension modules (e.g. GTK) got so burned by the above issues they force the default-binding to UTF-8 when they are loaded. This is kind of nasty because it's a silent side effect of loading the module. I haven't even touched on the issues of I/O to terminals and files which has a whole other set of issues related to encoding. All of this is written up in various places.

But actually there is a 4th option, wrap all calls to ldap so that it explicitly encodes to UTF-8 when passing data in, and explicitly decodes to unicode for received data. This also allows one to handle other LDAP issues such as binary data, datetime objects, etc. We do in other projects (i.e. IPA) The advantage of this approach is you can use all manner of custom Python classes to store your data and they get serialized into what python-ldap expects and converted back into the desired Python classes when read back from LDAP. But the implementation cost of that approach is not warranted here.