lxml.html.document_fromstring fails with certain emojis
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
Python : sys.version_
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
LXML 4.6.3 in MacOS Mojave (10.14.6) fails to parse html input in unicode str or utf-8 bytes for certain emojis (ZWJ sequences)
This pytest test script (from my test suite) will pass if the bug is present:
import pytest
from lxml.html import document_fromstring
def test_lxml_
def assert_
# Woman Facepalming Emoji
# See https:/
content = u'<p>\U0001F926
doc = document_
assert doc[0][0].text == u'\U0001F926\
with pytest.
with pytest.
assert_
assert_
Notice that a workaround for this issue is to use UTF-16 or UTF-32 bytes.
I've encountered similar behaviour on MacOS 11.7 (Big Sur) when parsing an example UTF-8 encoded HTML file that contains at least two multibyte characters.
One detail learned while attempting to narrow down the cause: the problem disappears when the 'lxml' dependency is installed from binary wheel.
A near-minimal repro case is available at https:/ /github. com/jayaddison/ macos-lxml- issue-repro. git/