Bug #1172106 “Live LDAP tests fail on unicode names” : Bugs : OpenStack Identity (keystone)

Dolph Mathews (dolph) on 2013-04-29

summary:	- Error in live ldap test + Live LDAP tests fail on unicode names
Changed in keystone:
importance:	Undecided → Medium

Revision history for this message

Sahdev Zala (spzala) wrote on 2013-04-29:

#1

Hi Dolph, given that this test is not needed for LDAP.. should we just 'pass' it for LDAP?

jagan kumar kotipatruni (jagankumar-k) on 2013-05-21

Changed in keystone:
assignee:	nobody → jagan kumar kotipatruni (jagankumar-k)

Dolph Mathews (dolph) on 2013-06-10

Changed in keystone:
status:	New → Triaged

Revision history for this message

Dolph Mathews (dolph) wrote on 2013-06-10:

#2

The web API is expected to support unicode; how is the test not needed for LDAP?

Revision history for this message

Sahdev Zala (spzala) wrote on 2013-06-10:

#3

Hi Dolph, I was just wondering looking at the bug where this test was introduced to take care of sql specific bug,
https://bugs.launchpad.net/keystone/+bug/1166701 (https://review.openstack.org/#/c/26465/)

Revision history for this message

Adam Young (ayoung) wrote on 2013-06-10:

#4

Test is needed, but the FakeLDAP impl doesn't enforce Unicode. Only shows up in the live_ldap tests.

Revision history for this message

Adam Young (ayoung) wrote on 2013-06-11:

#5

rcit points out that the problem is likely the LDIF file/migration for the default role:

"Looking at the ldif I'm guessing it is this entry that is the problem:

dn: cn=9fe2ff9ee4384b1894a90878d3e92bab,ou=Roles,dc=openstack,dc=org
objectClass: organizationalRole
ou: _member_
cn: 9fe2ff9ee4384b1894a90878d3e92bab
...
I'm guessing that python-ldap is dealing with the base64-decoded version of this as a plain string and that is why it is blowing up."

rob

Adam Young (ayoung) on 2013-06-11

Changed in keystone:
assignee:	jagan kumar kotipatruni (jagankumar-k) → John Dennis (jdennis-a)

Revision history for this message

John Dennis (jdennis-a) wrote on 2013-06-14:

#6

Download full text (4.3 KiB)

The short story is this is failing because python-ldap is not unicode safe by default.

Some more details.

Python unicode objects are UCS-4, thus a character is 32-bits wide and uses the Unicode code points directly. libldap which python-ldap links against expects an single octet string encoded in UTF-8. Somewhere between the 4 octet characters in the unicode object and the libldap library calls the UCS-4 characters have to be encoded into UTF-8. There are a number of ways this can happen.

1) Set the Python default-encoding to UTF-8 (by default it's ASCII)

2) Explicitly perform the utf-8 encoding every time you call ldap (or for that matter any external library with a Python binding which expects UTF-8)

3) Perform the UTF-8 encoding in the ldap Python binding.

But here are the issues:

Option 3 is the ideal, the binding should take care of this, however most Python extension bindings do not do the right thing by explicitly performing the encoding. Instead many extension bindings rely on Python's argument passing mechanism to perform the encoding for them. However the built-in encoding is controlled by Python's default-encoding value. Which by default is ASCII not UTF-8. To make matters worse the default-encoding is set in site.py, a site local file, site.py is read very early on and then the default-encoding is locked, you can't change it. There is a long and sordid history to this the details of which I'll spare you.

Since the extension binding falls back to the built-in encoding mechanism and the unmodifiable default-encoding on many distributions is ASCII there is an attempt to encode a unicode code point into the 7-bit ASCII range. This of course fails and throws the exception. If the default-encoding had been UTF-8 it would have worked wonderfully.

It's better to perform the encoding in the extension binding because only the binding knows what needs to be passed to the underlying C library it's calling into. UTF-8 is the norm in Linux/UNIX.

Option 2, explicitly encode/decode around every library call is ugly and error prone. It's best avoided if possible. However you see a lot of Python code which does this, usually because they couldn't figure out a viable alternative.

Option 1 is the best. If you set the default-encoding to UTF-8 in Python there is no need to do any explicit encoding, Python will do the right thing in 99.9% of the cases. However you still need to decode back into unicode objects when receiving strings from library calls.

The way we've solved the default encoding problem in the past is with a trivial extension module which we load before any other modules which sets the default-encoding to UTF-8. There are other tricks such as reloading site.py to get around the locking issue.

The short story is Python 2 is very screwed up with respect to Unicode strings, str objects, default encodings etc. Part of the problem is many of these features were added to Python 2 later with all the compatibility issues such an approach engenders. By far the worst decision was making the default encoding ASCII. But the good news is that Python 3 has cleaned up this mess and the problems pretty much go away in Python 3.

My recomme...

The short story is this is failing because python-ldap is not unicode safe by default.

Some more details.

Python unicode objects are UCS-4, thus a character is 32-bits wide and uses the Unicode code points directly. libldap which python-ldap links against expects an single octet string encoded in UTF-8. Somewhere between the 4 octet characters in the unicode object and the libldap library calls the UCS-4 characters have to be encoded into UTF-8. There are a number of ways this can happen.

1) Set the Python default-encoding to UTF-8 (by default it's ASCII)

2) Explicitly perform the utf-8 encoding every time you call ldap (or for that matter any external library with a Python binding which expects UTF-8)

3) Perform the UTF-8 encoding in the ldap Python binding.

But here are the issues:

Option 3 is the ideal, the binding should take care of this, however most Python extension bindings do not do the right thing by explicitly performing the encoding. Instead many extension bindings rely on Python's argument passing mechanism to perform the encoding for them. However the built-in encoding is controlled by Python's default-encoding value. Which by default is ASCII not UTF-8. To make matters worse the default-encoding is set in site.py, a site local file, site.py is read very early on and then the default-encoding is locked, you can't change it. There is a long and sordid history to this  the details of which I'll spare you.

Since the extension binding falls back to the built-in encoding mechanism and the unmodifiable default-encoding on many distributions is ASCII there is an attempt to encode a unicode code point into the 7-bit ASCII range. This of course fails and throws the exception. If the default-encoding had been UTF-8 it would have worked wonderfully.

It's better to perform the encoding in the extension binding because only the binding knows what needs to be passed to the underlying C library it's calling into. UTF-8 is the norm in Linux/UNIX.

Option 2, explicitly encode/decode around every library call is ugly and error prone. It's best avoided if possible. However you see a lot of Python code which does this, usually because they couldn't figure out a viable alternative.

Option 1 is the best. If you set the default-encoding to UTF-8 in Python there is no need to do any explicit encoding, Python will do the right thing in 99.9% of the cases. However you still need to decode back into unicode objects when receiving strings from library calls.

The way we've solved the default encoding problem in the past is with a trivial extension module which we load before any other modules which sets the default-encoding to UTF-8. There are other tricks such as reloading site.py to get around the locking issue.

The short story is Python 2 is very screwed up with respect to Unicode strings, str objects, default encodings etc. Part of the problem is many of these features were added to Python 2 later with all the compatibility issues such an approach engenders. By far the worst decision was making the default encoding ASCII. But the good news is that Python 3 has cleaned up this mess and the problems pretty much go away in Python 3.

My recommendation is that we override the default-encoding and set it to UTF-8 as one of the first modules loaded. Whether this done via an extension module or via the reload hack is open, I prefer the extension module.

BTW, some extension modules (e.g. GTK) got so burned by the above issues they force the default-binding to UTF-8 when they are loaded. This is kind of nasty because it's a silent side effect of loading the module. I haven't even touched on the issues of I/O to terminals and files which has a whole other set of issues related to encoding. All of this is written up in various places.

But actually there is a 4th option, wrap all calls to ldap so that it explicitly encodes to UTF-8 when passing data in, and explicitly decodes to unicode for received data. This also allows one to handle other LDAP issues such as binary data, datetime objects, etc. We do in other projects (i.e. IPA) The advantage of this approach is you can use all manner of custom Python classes to store your data and they get serialized into what python-ldap expects and converted back into the desired Python classes when read back from LDAP. But the implementation cost of that approach is not warranted here.

Revision history for this message

Dolph Mathews (dolph) wrote on 2013-06-18:

#7

Thanks for the details!

Changed in keystone:
status:	Triaged → Confirmed

Revision history for this message

Sahdev Zala (spzala) wrote on 2013-07-24:

#8

I am digging more and testing different options.

Revision history for this message

John Dennis (jdennis-a) wrote on 2013-07-24:

#9

Try this, I was going to but didn't have time. It's an experiment to set the default encoding to utf-8, if my theory is correct the problem will go away.

I'm attaching 2 files: default_encoding.c setup.py

It will build a trivial Python extension that forces the default encoding to be utf-8, it's what I alluded to in my earlier comment.

Build it like this:

% python setup.py build

It will generate a python module default_encoding_utf8.so under build/lib*

Make sure default_encoding_utf8.so is in your Python path.

Merely importing the module will reset the default encoding to utf-8, e.g.:

$ python
Python 2.7.3 (default, Aug 9 2012, 17:23:57)
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> import default_encoding_utf8
>>> print sys.getdefaultencoding()
utf-8

Then make sure you import default_encoding_utf8 as early in the import process as possible. Run the test. Did the problem go away?

Revision history for this message

John Dennis (jdennis-a) wrote on 2013-07-24:

#10

CPython module to reset default encoding Edit (1.7 KiB, text/plain)

Revision history for this message

John Dennis (jdennis-a) wrote on 2013-07-24:

#11

setup file used to build module Edit (1.4 KiB, text/x-python)

Revision history for this message

Sahdev Zala (spzala) wrote on 2013-07-24:

#12

Thanks a lot, John! Yep, I tested it and the problem go away :-). It set's default encoding to utf-8.

Dolph/Brant, so I guess use of the .so file generated by using the files John has provided (his above comments) might be our best option. I have tested it is by copying the .so file under /opt/stack/keystone (directory that's in the python path) and then importing the module in the code.

I also tested two different things:
1. By manually declaring encoding which python understands well. For example, u'name \u540d\u5b57'.encode('utf-8'). But this might not be a great idea, as it requires declaring encoding every time you are using unicode. Also, this requires you to decode encoding if you are using the variable as a string somewhere else.

2. By modifying /usr/lib/python2.7/sitecustomize.py to make default encoding utf-8,
Which takes care of problem, but again I think this may not be a good option to handle programatically.

Revision history for this message

Brant Knudson (blk-u) wrote on 2013-07-24:

#13

I don't see how we're going to use .so files. They get build for specific versions of operating systems so I wouldn't expect it to work very well for a project that works on lots of different operating systems.

Revision history for this message

John Dennis (jdennis-a) wrote on 2013-07-24:

#15

The test with the CPython module was more for informational purposes at the moment. What we wanted to confirm was my original supposition it was the default encoding of ASCII that was the culprit and changing the default encoding to UTF-8 would solve the problem. We've now confirmed that, that's good. Now the issue how to reset the default encoding.

With the Python installations I'm familiar with the default encoding is set in site.py and then sys.setdefaultencoding is removed from sys to prevent it from being reset. I have a long write up on the issues if you're interested, but unfortunately it's not in a public blog at the moment.

There is a trick for reloading the sys module and getting the sys.setdefaultencoding function back. That does not require a C module. FWIW the C module works because you can always reset the default encoding from within the Python interpreter.

The reload(sys) trick is simple and does not require a C module, here's an example

$ python
Python 2.7.3 (default, Aug 9 2012, 17:23:57)
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'
>>>

In other project we use the C module approach because we were distrustful of any side-effects of reloading sys, but to be honest I don't know of any side-effects, we were just being (overly?) careful.

FWIW, if the C module were on pypi getting it installed and compiled with OpenStack would probably be trivial (or at least I think).

But for sure we can use the reload(sys) trick too.

Just two words of caution

1) Changing the default encoding affects everything running in the interpreter, that means every piece of Python code. I'm not against this because using ASCII is just plain wrong, it really should be UTF-8. Changing it globally to UTF-8 will probably fix any number of lurking issues. But on the other hand there is a possibility some code exists which depends on ASCII as the default encoding and it might break something far removed, but I really doubt anything would be depending on ASCII as the default encoding for proper operation and if it was it's wrong IMHO. Besides ASCII is a proper subset of UTF-8.

2) The default encoding needs to be reset as early as possible when modules load (best if it's first). Why? Because in Python2 strings cache the result of the default encoding conversion. Once cached when the string is referenced it simply reuses the cached encoding. If you switch the default encoding previously cached strings will reference encoded values from the previous default encoding setting. This is why they remove sys.setdefaultencoding() in site.py after they set the default encoding, otherwise you will have inconsistent cached encodings.

The test with the CPython module was more for informational purposes at the moment. What we wanted to confirm was my original supposition it was the default encoding of ASCII that was the culprit and changing the default encoding to UTF-8 would solve the problem. We've now confirmed that, that's good. Now the issue how to reset the default encoding.

With the Python installations I'm familiar with the default encoding is set in site.py and then sys.setdefaultencoding is removed from sys to prevent it from being reset. I have a long write up on the issues if you're interested, but unfortunately it's not in a public blog at the moment.

There is a trick for reloading the sys module and getting the sys.setdefaultencoding function back. That does not require a C module. FWIW the C module works because you can always reset the default encoding from within the Python interpreter.

The reload(sys) trick is simple and does not require a C module, here's an example

$ python
Python 2.7.3 (default, Aug  9 2012, 17:23:57) 
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'
>>>

In other project we use the C module approach because we were distrustful of any side-effects of reloading sys, but to be honest I don't know of any side-effects, we were just being (overly?) careful.

FWIW, if the C module were on pypi getting it installed and compiled with OpenStack would probably be trivial (or at least I think).

But for sure we can use the reload(sys) trick too.

Just two words of caution

1) Changing the default encoding affects everything running in the interpreter, that means every piece of Python code. I'm not against this because using ASCII is just plain wrong, it really should be UTF-8. Changing it globally to UTF-8 will probably fix any number of lurking issues. But on the other hand there is a possibility some code exists which depends on ASCII as the default encoding and it might break something far removed, but I really doubt anything would be depending on ASCII as the default encoding for proper operation and if it was it's wrong IMHO. Besides ASCII is a proper subset of UTF-8.

2) The default encoding needs to be reset as early as possible when modules load (best if it's first). Why? Because in Python2 strings cache the result of the default encoding conversion. Once cached when the string is referenced it simply reuses the cached encoding. If you switch the default encoding previously cached strings will reference encoded values from the previous default encoding setting. This is why they remove sys.setdefaultencoding() in site.py after they set the default encoding, otherwise you will have inconsistent cached encodings.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-07-25: Fix proposed to keystone (master)

#16

Fix proposed to branch: master
Review: https://review.openstack.org/38711

Changed in keystone:
assignee:	John Dennis (jdennis-a) → Sahdev Zala (spzala)
status:	Confirmed → In Progress

Revision history for this message

Sahdev Zala (spzala) wrote on 2013-07-25:

#17

John, thanks for the nice detail!

Yesterday, I had some brainstorming with Brant on IRC and we decided to see how much work/code change it requires to wrap Keystone LDAP code to process Unicode. We decided to go with a specific use case for now – i.e. handle the failing test. Seems like it’s a significant work if we want to wrap all the tests and we may run into unknown risk. I am updating a patch for an initial review.

I guess changing the default coding in customer environment may not be a good option. Just a thought that since we ran into this only after Unicode specific test and I believe we don’t have any customer/user raised concerned about this problem, may we as a safe solution we just document something like, python doesn’t support Unicode by default and it’s easy to change default encoding (doc how to) or they can manually declare encoding… In manual declaration of encoding, I notice that LOG.debug fails to use unicoded string unless we decode before the use (the patch also shows it). We can modify our test to show this behavior, i.e. manually declaring encoding for the user name, which requires code of couple of lines only.

Also, python 3.x sets default encoding to UTF-8, so it's not a problem for python version 3 and above. I have tested the encoding with 3.2.3,
Python 3.3.2 (default, Jul 25 2013, 10:03:04)
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.getdefaultencoding()
'utf-8'

With the code changes in the patch, I have tested that when we set default encoding to utf-8, nothing is breaking. (considering, if the code is run on python 3)

I am up for any approach that we all can agreed upon.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-08-08:

#18

Fix proposed to branch: master
Review: https://review.openstack.org/40986

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-08-15: Fix merged to keystone (master)

#19

Reviewed: https://review.openstack.org/40986
Committed: http://github.com/openstack/keystone/commit/54a4c0696e3817307b8e9e50a2ffa5b5013e1f2e
Submitter: Jenkins
Branch: master

commit 54a4c0696e3817307b8e9e50a2ffa5b5013e1f2e
Author: Brant Knudson <email address hidden>
Date: Thu Aug 8 15:36:20 2013 -0500

Skip test_create_unicode_user_name in _ldap_livetest

Live LDAP tests were not passing because this test doesn't work.
This is being addressed with a different bug.

    Change-Id: Ic01aa505d867c1de30e2a1ed7c79ff1478e213ef
    Related-Bug: #1172106
    Related-Bug: #1210175

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-08-23: Fix proposed to keystone (stable/grizzly)

#20

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/43524

Sahdev Zala (spzala) on 2013-09-06

Changed in keystone:
assignee:	Sahdev Zala (spzala) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-01: Related fix proposed to keystone (master)

#21

Related fix proposed to branch: master
Review: https://review.openstack.org/54929

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-27: Related fix merged to keystone (master)

#22

Reviewed: https://review.openstack.org/54929
Committed: http://github.com/openstack/keystone/commit/c7468ee376fd7dee8f0a934d4f100ac5d904937d
Submitter: Jenkins
Branch: master

commit c7468ee376fd7dee8f0a934d4f100ac5d904937d
Author: Elena Ezhova <email address hidden>
Date: Fri Nov 1 17:53:37 2013 +0400

Skip test_create_update_delete_unicode_project in _ldap_livetest

Live LDAP tests fail because this test doesn't work.

    This failure occures on the same reason as it was with
    test_create_unicode_user_name
    (Ic01aa505d867c1de30e2a1ed7c79ff1478e213ef)

Related bug: 1172106

Change-Id: I0422d14c937030c39a17776e7d321bd629d50b31

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-23: Fix proposed to keystone (master)

#23

Fix proposed to branch: master
Review: https://review.openstack.org/82399

Changed in keystone:
assignee:	nobody → John Dennis (jdennis-a)

Revision history for this message

John Dennis (jdennis-a) wrote on 2014-03-23:

#24

To properly fix this we cannot just globally change the default
encoding, that is a temporary workaround not a structural fix
consistent with OpenStack coding practice and Python3 semantics.

This is a sequence 4 patches. The full commit was broken down into the
4 patches to facilitate review where each patch implements one phase
of the total fix. Please see each commit message for details and
rationale for the change.

The correct way to handle non-ascii characters is to always use
unicode strings. In Python2 this requires the use of the unicode
string object instead of the str string object. In Python3 all strings
are unicode (str objects are actually unicode and what was str in
Python2 becomes bytes object in Python3). Thus all strings in
OpenStack code should be unicode in Python2 and will by definition be
unicode in Python3.

External library interfacess are often specified to require UTF-8
encoding for strings. This is because UTF-8 encoding is a byte (octet)
stream and a proper subset of ASCII. This is especially true of
libraries written in C or that implement RFC's whose specification
specifies strings are UTF-8 encoded, LDAP, XML, HTTP, etc. are common
examples.

The natural consequence of this is Python maintains it's strings as
unicode (either UCS-2 or UCS-4) and conversion to/from UTF-8 occurs at
I/O and/or API boundaries, in other words when string data is entering
or leaving the "python domain".

python-ldap is the standard LDAP API for interacting with LDAP from
Python. python-ldap requires UTF-8 encoded strings. It
would have been ideal if inside the python-ldap API binding it
converted unicode strings to UTF-8 but it doesn't and this unfortunate
omission requires us to do the conversion when calling LDAP and on the
data returned from LDAP. The fact the python-ldap API does not perform
UTF-8 conversion just means doing the conversion ourselves is
consistent with any other API or I/O boundary requiring UTF-8.

To expedite LDAP testing without requiring a running live LDAP server
a fake LDAP API was introduced which emulates LDAP. Unfortunately the
fake LDAP is a poor emulation. For example it does not demand all LDAP
data be converted to strings nor that strings are UTF-8 encoded. This
meant a considerable portion of the LDAP unit tests were not catching
potential problems with data types being passed through the LDAP
API. Many of these problems only showed up during the occassional
testing against a live LDAP server using the python-ldap interface.

To address these issues the following was done:

* An abstract LDAP interface was defined. Both fake ldap and live ldap
implement this interface. The interface requires UTF-8 encoded
strings.

* An instance of the same abstract LDAP interface was implemented
  whose job it is to perform type conversion and logging, then then
  call one of the LDAP instances to perform the actual LDAP
  operation. Note, type conversion includes other things besides UTF-8
  conversion, it also includes converting Python types such as
  booleans, integers, etc. to a string representation.

* The test coverage for non-ascii values was greatly expanded.

To properly fix this we cannot just globally change the default
encoding, that is a temporary workaround not a structural fix
consistent with OpenStack coding practice and Python3 semantics.

This is a sequence 4 patches. The full commit was broken down into the
4 patches to facilitate review where each patch implements one phase
of the total fix. Please see each commit message for details and
rationale for the change.

The correct way to handle non-ascii characters is to always use
unicode strings. In Python2 this requires the use of the unicode
string object instead of the str string object. In Python3 all strings
are unicode (str objects are actually unicode and what was str in
Python2 becomes bytes object in Python3). Thus all strings in
OpenStack code should be unicode in Python2 and will by definition be
unicode in Python3.

External library interfacess are often specified to require UTF-8
encoding for strings. This is because UTF-8 encoding is a byte (octet)
stream and a proper subset of ASCII. This is especially true of
libraries written in C or that implement RFC's whose specification
specifies strings are UTF-8 encoded, LDAP, XML, HTTP, etc. are common
examples.

The natural consequence of this is Python maintains it's strings as
unicode (either UCS-2 or UCS-4) and conversion to/from UTF-8 occurs at
I/O and/or API boundaries, in other words when string data is entering
or leaving the "python domain".

python-ldap is the standard LDAP API for interacting with LDAP from
Python. python-ldap requires UTF-8 encoded strings. It
would have been ideal if inside the python-ldap API binding it
converted unicode strings to UTF-8 but it doesn't and this unfortunate
omission requires us to do the conversion when calling LDAP and on the
data returned from LDAP. The fact the python-ldap API does not perform
UTF-8 conversion just means doing the conversion ourselves is
consistent with any other API or I/O boundary requiring UTF-8.

To expedite LDAP testing without requiring a running live LDAP server
a fake LDAP API was introduced which emulates LDAP. Unfortunately the
fake LDAP is a poor emulation. For example it does not demand all LDAP
data be converted to strings nor that strings are UTF-8 encoded. This
meant a considerable portion of the LDAP unit tests were not catching
potential problems with data types being passed through the LDAP
API. Many of these problems only showed up during the occassional
testing against a live LDAP server using the python-ldap interface.

To address these issues the following was done:

* An abstract LDAP interface was defined. Both fake ldap and live ldap
  implement this interface. The interface requires UTF-8 encoded
  strings.

* An instance of the same abstract LDAP interface was implemented
  whose job it is to perform type conversion and logging, then then
  call one of the LDAP instances to perform the actual LDAP
  operation. Note, type conversion includes other things besides UTF-8
  conversion, it also includes converting Python types such as
  booleans, integers, etc. to a string representation.

* The test coverage for non-ascii values was greatly expanded.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-28: Fix merged to keystone (master)

#25

Reviewed: https://review.openstack.org/82397
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=ebb59a75cecc71ca7cc137e16056a4c8b513fd8d
Submitter: Jenkins
Branch: master

commit ebb59a75cecc71ca7cc137e16056a4c8b513fd8d
Author: John Dennis <email address hidden>
Date: Sat Mar 22 11:19:56 2014 -0400

Refactor LDAP API

    The fake LDAP API must emulate the python-ldap API as much as possible
    otherwise much of the LDAP testing is invalid. The python-ldap API
    only accepts utf-8 encoded strings. However, the fake LDAP API accepts
    any Python type therefore properly handling type conversion into and
    out of the LDAP API is not exercised by the fake LDAP API during
    testing. Currently type conversion is done inside the LdapWrapper
    which calls the python-ldap API, this means unicode issues only appear
    when testing with a live LDAP server.

    LdapWrapper and FakeLdap logically are two different providers of the
    same API, as such they should behave identically. Which LDAP API is
    used at run time a configurable option.

    We need a mechanism by which we can substitute an LDAP API and then
    wrap the calls to that API with type conversions. Type conversion
    wrapping replaces the Python types used in Keystone with the types
    needed for the LDAP API, calls the LDAP API, and then type converts
    the results back from LDAP to those used by Keystone.

    This patch establishes an LDAP API interface (LDAPHandler), modifies
    fake LDAP to support it, replaces LdapWrapper with the interface
    (invoking python-ldap) and adds another LDAPHandler instance which
    will be the common location for type conversions prior to calling the
    configured LDAP interface. See the LDAPHandler class definition for
    details).

    This patch is exclusively a refactoring patch anticipating a
    subsequent patch to properly handle unicode values. There is no
    significant change in functionality with this patch, it is just
    refactoring to more cleanly seperate API boundaries. A few tests which
    exercised unicode were disabled in this patch because they will not
    work until the next patch which adds back in correct unicode
    handling. The idea here is to separate out the refactoring needed to
    support unicode from the actual unicode changes, this should make
    reviewing easier.

Partial-Bug: 1172106
Change-Id: I7db24040689245a616332b08744f40ab8381579d

Reviewed:  https://review.openstack.org/82397
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=ebb59a75cecc71ca7cc137e16056a4c8b513fd8d
Submitter: Jenkins
Branch:    master

commit ebb59a75cecc71ca7cc137e16056a4c8b513fd8d
Author: John Dennis <jdennis@redhat.com>
Date:   Sat Mar 22 11:19:56 2014 -0400

Refactor LDAP API
    
    The fake LDAP API must emulate the python-ldap API as much as possible
    otherwise much of the LDAP testing is invalid. The python-ldap API
    only accepts utf-8 encoded strings. However, the fake LDAP API accepts
    any Python type therefore properly handling type conversion into and
    out of the LDAP API is not exercised by the fake LDAP API during
    testing. Currently type conversion is done inside the LdapWrapper
    which calls the python-ldap API, this means unicode issues only appear
    when testing with a live LDAP server.
    
    LdapWrapper and FakeLdap logically are two different providers of the
    same API, as such they should behave identically. Which LDAP API is
    used at run time a configurable option.
    
    We need a mechanism by which we can substitute an LDAP API and then
    wrap the calls to that API with type conversions. Type conversion
    wrapping replaces the Python types used in Keystone with the types
    needed for the LDAP API, calls the LDAP API, and then type converts
    the results back from LDAP to those used by Keystone.
    
    This patch establishes an LDAP API interface (LDAPHandler), modifies
    fake LDAP to support it, replaces LdapWrapper with the interface
    (invoking python-ldap) and adds another LDAPHandler instance which
    will be the common location for type conversions prior to calling the
    configured LDAP interface. See the LDAPHandler class definition for
    details).
    
    This patch is exclusively a refactoring patch anticipating a
    subsequent patch to properly handle unicode values. There is no
    significant change in functionality with this patch, it is just
    refactoring to more cleanly seperate API boundaries. A few tests which
    exercised unicode were disabled in this patch because they will not
    work until the next patch which adds back in correct unicode
    handling. The idea here is to separate out the refactoring needed to
    support unicode from the actual unicode changes, this should make
    reviewing easier.
    
    Partial-Bug: 1172106
    Change-Id: I7db24040689245a616332b08744f40ab8381579d

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-28:

#26

Reviewed: https://review.openstack.org/82398
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=cbf805161b84f13f459a19bfd46220c4f298b264
Submitter: Jenkins
Branch: master

commit cbf805161b84f13f459a19bfd46220c4f298b264
Author: John Dennis <email address hidden>
Date: Sat Mar 22 13:54:04 2014 -0400

Properly handle unicode & utf-8 in LDAP

This patch adds all the necessary type conversions between the LDAP
API's.

* string literals are unicode

* unicode strings are utf-8 encoded before calling LDAP

* utf-8 strings received from the LDAP API are decoded into unicode

* string classes use the six.text_type for Python 2 vs. Python 3
compatibility

    * the fake LDAP implementation was reworked such that it's external
      API only handles UTF-8 encoded strings but only uses unicode
      internally. This is because internally it must be able to operate on
      logical characters in order to perform string operations on it's
      data. This is very much akin to what happens in a real LDAP
      implementation, the interface is UTF-8 but operations occur on
      decoded logical characters.

Unicode tests that were skipped are now re-enabled.

Partial-Bug: 1172106
Change-Id: Icce6b508f748214e241de40c3c9389b2caccea83

Changed in keystone:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-28:

#27

Reviewed: https://review.openstack.org/82399
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=1a5fa1a333cb48dd80311594efcfac89752d6954
Submitter: Jenkins
Branch: master

commit 1a5fa1a333cb48dd80311594efcfac89752d6954
Author: John Dennis <email address hidden>
Date: Sat Mar 22 14:17:02 2014 -0400

Expand the use of non-ascii values in ldap test

    Very few of the ldap tests were using non-ascii values, in fact
    non-ascii values were restricted to only specific tests that
    had 'unicode' in their test name. This is very weak test coverage.

    This patch replaces all occurances of 'fake1', the standard string
    used in the tests for test value with 'fäké1' where the a has an
    umlaut and the e has a diacritical. Visually they look almost the
    same but will trigger the type of encoding exceptions we've seen
    in the past.

Closes-Bug: 1172106

Change-Id: I03b10f3da93a8fb388baacb00532c03019f327c0

Dolph Mathews (dolph) on 2014-04-04

tags:

added: ldap

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-07: Fix proposed to keystone (milestone-proposed)

#28

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/85770

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-07:

#29

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/85771

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-04-07:

#30

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/85772

Thierry Carrez (ttx) on 2014-04-08

tags:

added: icehouse-rc-potential

Thierry Carrez (ttx) on 2014-04-17

tags:

added: icehouse-backport-potential
removed: icehouse-rc-potential

Revision history for this message

Openstack Gerrit (openstack-gerrit) wrote on 2014-05-02: Fix proposed to keystone (stable/icehouse)

#31

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/91883

Alan Pevec (apevec) on 2014-05-30

Changed in keystone:
milestone:	none → juno-1

Thierry Carrez (ttx) on 2014-06-11

Changed in keystone:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-12: Related fix proposed to keystone (master)

#32

Related fix proposed to branch: master
Review: https://review.openstack.org/99646

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-17: Fix merged to keystone (stable/icehouse)

#33

Reviewed: https://review.openstack.org/91883
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=935fd60326feafd767993475a48b9f5973c828db
Submitter: Jenkins
Branch: stable/icehouse

commit 935fd60326feafd767993475a48b9f5973c828db
Author: John Dennis <email address hidden>
Date: Fri May 2 14:14:20 2014 -0400

Encode/Decode LDAP parameters to/from UTF-8

    The python-ldap API only accepts UTF-8 encoded strings therefore any
    unicode values must be encoded to UTF-8 prior to passing to
    python-ldap and conversely UTF-8 encoded strings returned from
    python-ldap need to be decoded back from UTF-8 into unicode.

Need to use unicode() rather than str() to properly handle non-ascii
characters, but to be PY2/PY3 compatible use six.text_type.

    Very few of the ldap tests were using non-ascii values, in fact
    non-ascii values were restricted to only specific tests that
    had 'unicode' in their test name. This is very weak test coverage.
    Replace all occurances of 'fake', the standard string
    used in the tests for test value with 'fäké' where the a has an
    umlaut and the e has a diacritical. Visually they look almost the
    same but will trigger the type of encoding exceptions we've seen
    in the past.

    This is the minimal backport for icehouse from the following
    master commits:
    https://review.openstack.org/#/c/82396/
    https://review.openstack.org/#/c/82398/
    https://review.openstack.org/#/c/82399/

Closes-Bug: #1172106

Change-Id: I6b328dfc8435457a8d7ff16320f3d869cfa1043c

tags:

added: in-stable-icehouse

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-17: Related fix merged to keystone (master)

#34

Reviewed: https://review.openstack.org/99646
Committed: https://git.openstack.org/cgit/openstack/keystone/commit/?id=f1cb3d0fc3399817cce2a870ef510fbf803934fb
Submitter: Jenkins
Branch: master

commit f1cb3d0fc3399817cce2a870ef510fbf803934fb
Author: John Dennis <email address hidden>
Date: Thu Jun 12 08:40:43 2014 -0400

Add missing docstrings and 1 unittest for LDAP utf-8 fixes

A minimal backport of these accepted master commits:

    https://review.openstack.org/#/c/82396/
    https://review.openstack.org/#/c/82398/
    https://review.openstack.org/#/c/82399/

was done for Icehouse in the following:

https://review.openstack.org/#/c/91883/

    In the above backport review it was requested docstrings be added for
    some functions and an additional unittest be added. That work was done
    for the Icehouse backport and was positively reviewed, at the same it
    was requested the those changes be reflected back into master so as
    not to lose them going forward. This patch does that, adds the
    docstrings and one additional small unittest.

Change-Id: I881ba9b274692427d4c7b9f5357ee4735b4e6699
Related-Bug: #1172106

Chuck Short (zulcss) on 2014-08-07

tags:

removed: icehouse-backport-potential

Thierry Carrez (ttx) on 2014-10-16

Changed in keystone:
milestone:	juno-1 → 2014.2

OpenStack Identity (keystone)

Live LDAP tests fail on unicode names

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Identity (keystone)	Fix Released	Medium	John Dennis	OpenStack Identity (keystone) 2014.2 "juno"
Grizzly	Fix Released	Medium	Brant Knudson	OpenStack Identity (keystone) 2013.1.4
Icehouse	Fix Released	Medium	John Dennis	OpenStack Identity (keystone) 2014.1.2