ASCII tostring corrupts Russian data

Bug #1945048 reported by Ned Batchelder
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Medium
Unassigned

Bug Description

Parsing Russian XML content, then serializing as an ASCII string produces corrupted data. A paragraph is truncated, and ends with `\x10/p>`. Using encoding="utf8" or encoding="unicode" avoids the problem.

To reproduce:

---- 8< --------------
import requests
from lxml import etree

URL = "https://raw.githubusercontent.com/nedbat/nedbatcom/a53aa4d2a4aff80cad8775700b2c4866fd2cc795/pages/text/deleting-code_ru.px"
FILENAME = "data.px"

# Get the bytes in a file and parse them to DOM.
with open(FILENAME, "wb") as f:
    f.write(requests.get(URL).content)
element = etree.parse(FILENAME).getroot()

# An ascii string has corrupted data.
text = etree.tostring(element).decode('utf8')
print("\x10/p>" in text)

# A UTF8 string is not corrupted.
text = etree.tostring(element, encoding="utf8").decode("utf8")
print("\x10" in text)

# A Unicode string is not corrupted.
text = etree.tostring(element, encoding="unicode")
print("\x10" in text)
-----------------------------

Versions:
Python : sys.version_info(major=3, minor=9, micro=7, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Revision history for this message
scoder (scoder) wrote :

I can reproduce this, and it seems to come from the serialiser. However, the serialisation happens in libxml2, not lxml. Worth investigating if this works with other libxml2 versions than 2.9.10. (There's a problem with 2.9.11/12 that prevents their use, 2.9.13 has an uncertain release date.)

Changed in lxml:
importance: Undecided → Medium
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.