I got a expected result by using lxml.html.HTMLParser on both case. Is it a workaround?
----[test in ipython]--- In [1]: p = lxml.html.HTMLParser()
In [2]: p.feed(file('ok.html').read())
In [3]: p.close().cssselect('title')[0].text Out[3]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8'
In [4]: p = lxml.html.HTMLParser()
In [5]: p.feed(file('ng.html').read())
In [6]: p.close().cssselect('title')[0].text Out[6]: u'\u6f22\u5b57\u30c6\u30ad\u30b9\u30c8' ----
I got a expected result by using lxml.html. HTMLParser on both case.
Is it a workaround?
----[test in ipython]--- HTMLParser( )
In [1]: p = lxml.html.
In [2]: p.feed( file('ok. html'). read())
In [3]: p.close( ).cssselect( 'title' )[0].text u5b57\u30c6\ u30ad\u30b9\ u30c8'
Out[3]: u'\u6f22\
In [4]: p = lxml.html. HTMLParser( )
In [5]: p.feed( file('ng. html'). read())
In [6]: p.close( ).cssselect( 'title' )[0].text u5b57\u30c6\ u30ad\u30b9\ u30c8'
Out[6]: u'\u6f22\
----