lxml

ASCII tostring corrupts Russian data

Bug #1945048 reported by Ned Batchelder on 2021-09-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Triaged	Medium	Unassigned

Bug Description

Parsing Russian XML content, then serializing as an ASCII string produces corrupted data. A paragraph is truncated, and ends with `\x10/p>`. Using encoding="utf8" or encoding="unicode" avoids the problem.

To reproduce:

---- 8< --------------
import requests
from lxml import etree

URL = "https://raw.githubusercontent.com/nedbat/nedbatcom/a53aa4d2a4aff80cad8775700b2c4866fd2cc795/pages/text/deleting-code_ru.px"
FILENAME = "data.px"

# Get the bytes in a file and parse them to DOM.
with open(FILENAME, "wb") as f:
f.write(requests.get(URL).content)
element = etree.parse(FILENAME).getroot()

# An ascii string has corrupted data.
text = etree.tostring(element).decode('utf8')
print("\x10/p>" in text)

# A UTF8 string is not corrupted.
text = etree.tostring(element, encoding="utf8").decode("utf8")
print("\x10" in text)

# A Unicode string is not corrupted.
text = etree.tostring(element, encoding="unicode")
print("\x10" in text)
-----------------------------

Versions:
Python : sys.version_info(major=3, minor=9, micro=7, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Revision history for this message

scoder (scoder) wrote on 2021-09-28:

I can reproduce this, and it seems to come from the serialiser. However, the serialisation happens in libxml2, not lxml. Worth investigating if this works with other libxml2 versions than 2.9.10. (There's a problem with 2.9.11/12 that prevents their use, 2.9.13 has an uncertain release date.)

Changed in lxml:
importance:	Undecided → Medium
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.