The error occurs all the time for me. Whenever pages claim to be UTF-8 (in the HTTP header and the meta tags) but contain invalid characters, I get the above error. Here are a few ways to reproduce the bug:
In [1]: from lxml.html import parse
In [2]: root = parse('http://telofy.spline.de/foo/lxml-bug-690110.html').getroot()
In [3]: root.xpath('//br')[0].tail
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
Or in case I one day forget what that file’s purpose is and delete it:
/tmp/gist-1496487/<ipython console> in <module>()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:36181)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._collectText (src/lxml/lxml.etree.c:16915)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree.funicode (src/lxml/lxml.etree.c:22016)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
That gist can be found on GitHub [1].
What Python does, and what happened in the case of the original reporter, is that it replaces the invalid character with the replacement character �. This is identical to the behavior of a “text.decode('utf-8', errors='replace')” following the second example above.
I don’t know Cyphon, but perhaps you can just add this “errors='replace'” in [2] and possibly in [3] (but I haven’t tested any of this).
Sorry for the bad formatting, but I don’t know which markup syntax I can use in such comments, if any.
The error occurs all the time for me. Whenever pages claim to be UTF-8 (in the HTTP header and the meta tags) but contain invalid characters, I get the above error. Here are a few ways to reproduce the bug:
In [1]: from lxml.html import parse telofy. spline. de/foo/ lxml-bug- 690110. html'). getroot() '//br') [0].tail ------- ------- ------- ------- ------- ------- ------- ------- ------- -----
In [2]: root = parse('http://
In [3]: root.xpath(
-------
UnicodeDecodeError Traceback (most recent call last)
/tmp/gist- 1496487/ <ipython console> in <module>() python2. 7/dist- packages/ lxml/etree. so in lxml.etree. _Element. tail.__ get__ (src/lxml/ lxml.etree. c:36181) () python2. 7/dist- packages/ lxml/etree. so in lxml.etree. _collectText (src/lxml/ lxml.etree. c:16915) () python2. 7/dist- packages/ lxml/etree. so in lxml.etree.funicode (src/lxml/ lxml.etree. c:22016) ()
/usr/lib/
/usr/lib/
/usr/lib/
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
Or in case I one day forget what that file’s purpose is and delete it:
In [6]: import urllib2 /gist.github. com/raw/ 1496487/ 3290697d4202394 1296c2ba092b956 42ba03c5ee/ lxml-bug- 690110. html'). read() fromstring( text) '//br') [0].tail ------- ------- ------- ------- ------- ------- ------- ------- ------- -----
In [7]: text = urllib2.urlopen('https:/
In [8]: from lxml.html import document_fromstring
In [9]: root = document_
In [10]: root.xpath(
-------
UnicodeDecodeError Traceback (most recent call last)
/tmp/gist- 1496487/ <ipython console> in <module>() python2. 7/dist- packages/ lxml/etree. so in lxml.etree. _Element. tail.__ get__ (src/lxml/ lxml.etree. c:36181) () python2. 7/dist- packages/ lxml/etree. so in lxml.etree. _collectText (src/lxml/ lxml.etree. c:16915) () python2. 7/dist- packages/ lxml/etree. so in lxml.etree.funicode (src/lxml/ lxml.etree. c:22016) ()
/usr/lib/
/usr/lib/
/usr/lib/
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
That gist can be found on GitHub [1].
What Python does, and what happened in the case of the original reporter, is that it replaces the invalid character with the replacement character �. This is identical to the behavior of a “text.decode( 'utf-8' , errors='replace')” following the second example above.
I don’t know Cyphon, but perhaps you can just add this “errors='replace'” in [2] and possibly in [3] (but I haven’t tested any of this).
Sorry for the bad formatting, but I don’t know which markup syntax I can use in such comments, if any.
[1] https:/ /gist.github. com/1496487 /github. com/lxml/ lxml/blob/ c5c8cae024a5432 05c55e09af832c1 bf528d2a0d/ src/lxml/ apihelpers. pxi#L1344 /github. com/lxml/ lxml/blob/ c5c8cae024a5432 05c55e09af832c1 bf528d2a0d/ src/lxml/ apihelpers. pxi#L1332
[2] https:/
[3] https:/