LXML does not support unicode when building python3 and osx
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Howdy!
While working on https:/
I'll help in anyway I can but i'm afraid i'm a bit out of my element.
iconv versions:
Mac 10.12.2
iconv (GNU libiconv 1.11) (Though the same result under libiconv 1.15)
Linux
iconv (Ubuntu EGLIBC 2.15-0ubuntu10.18) 2.15
I also inspected parser.pxi the function _setupPythonUnicode to see what the enc value was on various versions.
Mac
2.7.13 = UTF-16LE
3.3.6 = UCS-4LE
Linux
2.7.13 = UCS-4LE
3.3.6 = UCS-4LE
What perplexes me is that libiconv should be able to handle this (to my... limited understanding)
iconv -l | grep UTF-16LE UTF-16LE
iconv -l | grep UCS-4LE UCS-4LE
Ive seen this behavior on my machine and travis CI.
This isn't due to libiconv, it's an incomplete implementation in lxml. See the difference between
https:/ /github. com/lxml/ lxml/blob/ ebafce689ae6270 4b1c0944bcd5b84 e34f275a2d/ src/lxml/ parser. pxi#L1014
and
https:/ /github. com/lxml/ lxml/blob/ ebafce689ae6270 4b1c0944bcd5b84 e34f275a2d/ src/lxml/ parser. pxi#L1251
This isn't easy to fix, because the incremental parser can receive arbitrary Unicode strings in different memory buffer formats (PEP-393) across its lifetime, which means that the data might need copying into a 4-byte format before passing it into libxml2, as we cannot repeatedly switch encodings at a per-byte level while parsing.