parsing Chinese string starts with `<` raises ParserError

Bug #1374250 reported by wonderfuly
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Low
Unassigned

Bug Description

>>> from lxml.html import fromstring
>>> fromstring('<你')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
    "Document is empty")
ParserError: Document is empty

你 is a Chinese character, the unicode representation is: '\u4f60'

>>> from lxml.html import fromstring
>>> fromstring('<\u4f60')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
    "Document is empty")
ParserError: Document is empty

So it seems like the combination `<\u` is the issue.

Version Info:

Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 4, 0, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

The exception is actually correct, there is no document to parse here.

However, given that the parser tries to recover from parse errors, It can be argued that it should return a document regardless, i.e. it should create an empty tag and return that.

Patches welcome.

Changed in lxml:
importance: Undecided → Low
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.