parsing Chinese string starts with `<` raises ParserError
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Low
|
Unassigned |
Bug Description
>>> from lxml.html import fromstring
>>> fromstring('<你')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/
doc = document_
File "/usr/lib/
"Document is empty")
ParserError: Document is empty
你 is a Chinese character, the unicode representation is: '\u4f60'
>>> from lxml.html import fromstring
>>> fromstring(
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/
doc = document_
File "/usr/lib/
"Document is empty")
ParserError: Document is empty
So it seems like the combination `<\u` is the issue.
Version Info:
Python : sys.version_
lxml.etree : (3, 4, 0, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
The exception is actually correct, there is no document to parse here.
However, given that the parser tries to recover from parse errors, It can be argued that it should return a document regardless, i.e. it should create an empty tag and return that.
Patches welcome.