clean_html creates unclean and invalid output
Bug #621080 reported by
Shish
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libxml2 |
New
|
Undecided
|
Unassigned | ||
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html.clean import clean_html
>>> for n in range(1, 20): clean_html(
...
'<div><style>.cake {color: blue;}</ style><
'<div><style>.cake {color: blue;}</ style>yle>
'<div><style>.cake {color: blue;}</ style>yle>
...
Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 2, 4, 0)
libxml used : (2, 7, 6)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
To post a comment you must log in.
Another example; this one only seems to happen on my 64-bit server:
>>> from lxml.html.clean import Cleaner ).clean_ html(a) nelements. ...... Cleans the document. \n\n___ _iinniitt_ ___(self, **kw)\n\ naallllooww_ _eelleemmeenntt (self, el)\n\naalllloo ww__eemmbbeeddd deedd__ uurrll( self, el, url)\n\ naallllooww_ _ffoollllooww( self, anchor)\n Override to suppress rel="nofollow" on some anchors. \n\ncclleeaann_ _hhttmmll( self, html)\n\ nkkiillll_ _ccoonnddiittii oonnaall_ _ccoommmmeenntt ss(self, doc)\n ........ set([\'embed\', \'iframe\ '])\nue\ n\n@</style> </div>'
>>> a = "<style>Moo</ style>"
>>> help(Cleaner)
>>> Cleaner(
'<div><style>Moo</ style>Instances cleans the document of each of the possible offending\