ASCII tostring corrupts Russian data
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Medium
|
Unassigned |
Bug Description
Parsing Russian XML content, then serializing as an ASCII string produces corrupted data. A paragraph is truncated, and ends with `\x10/p>`. Using encoding="utf8" or encoding="unicode" avoids the problem.
To reproduce:
---- 8< --------------
import requests
from lxml import etree
URL = "https:/
FILENAME = "data.px"
# Get the bytes in a file and parse them to DOM.
with open(FILENAME, "wb") as f:
f.write(
element = etree.parse(
# An ascii string has corrupted data.
text = etree.tostring(
print("\x10/p>" in text)
# A UTF8 string is not corrupted.
text = etree.tostring(
print("\x10" in text)
# A Unicode string is not corrupted.
text = etree.tostring(
print("\x10" in text)
-------
Versions:
Python : sys.version_
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
I can reproduce this, and it seems to come from the serialiser. However, the serialisation happens in libxml2, not lxml. Worth investigating if this works with other libxml2 versions than 2.9.10. (There's a problem with 2.9.11/12 that prevents their use, 2.9.13 has an uncertain release date.)