strange title text due to the position of meta charset
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Medium
|
Unassigned |
Bug Description
using: python 2.7.11 on win32 (32bit), lxml 3.6.1 (deetails are on end of this message)
While parsing old type html by lxml.html.parse(),
I got strange error/string on extracting title text by root.cssselect(
After some test, I figured out that the reason is order of <title> and <meta charset=xxx>.
It is good result when meta charset is appered before title,
but I got strange text or exception when title before meta.
There's an exapmle (both files are utf-8):
---[good case: ok.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
<head>
<meta charset="UTF-8" />
<title>
</head>
<body>
hello
</body>
</html>
-----------------
---[bad case: ng.html]---
<!DOCTYPE html>
<html lang="ja" class="col2r">
<head>
<title>
<meta charset="UTF-8" />
</head>
<body>
hello
</body>
</html>
----------------
This is the actual result in ipython:
----
In [1]: lxml.html.
Out[1]: u'\u6f22\
In [2]: lxml.html.
Out[2]: u'\xe6\
In [3]: lxml.html.
Out[3]: '\xe6\xbc\
----
As the result [2] and [3] tells that the bad case seems to be using raw utf-8 byte sequence as unicde string,
like encoding is latin-1 (raw 8bit charset).
I don't know parse reuslt of ng.htl should be same with ok.html.
BTW, with another title text, it cause different behaviours by parsing from file() object or StringIO() object.
former is like the ng case above latter cause UnicodeDecodeError on retrieving string via text property.
I don't know why.
---[version informations]---
Python : sys.version_
lxml.etree : (3, 6, 1, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
-------
(lxml is installed as win32 binary whl from http://
Is it a described behaviour? lxml.de/ parsing. html#parsing- html lxml.de/ parsing. html#python- unicode- strings
- http://
- http://
How to detect and workaround for such a ng case?