clean_html creates unclean and invalid output

Bug #621080 reported by Shish
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libxml2
New
Undecided
Unassigned
lxml
Invalid
Undecided
Unassigned

Bug Description

Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html.clean import clean_html
>>> for n in range(1, 20): clean_html("""<style>.cake {color: blue;}</ style>""")
...
'<div><style>.cake {color: blue;}</ style></style></div>'
'<div><style>.cake {color: blue;}</ style>yle></style></div>'
'<div><style>.cake {color: blue;}</ style>yle>yle></style></div>'
...

Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 2, 4, 0)
libxml used : (2, 7, 6)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
Shish (shish) wrote :
Revision history for this message
Shish (shish) wrote :

Another example; this one only seems to happen on my 64-bit server:

>>> from lxml.html.clean import Cleaner
>>> a = "<style>Moo</ style>"
>>> help(Cleaner)
>>> Cleaner().clean_html(a)
'<div><style>Moo</ style>Instances cleans the document of each of the possible offending\nelements. ...... Cleans the document.\n\n____iinniitt____(self, **kw)\n\naallllooww__eelleemmeenntt(self, el)\n\naallllooww__eemmbbeeddddeedd__uurrll(self, el, url)\n\naallllooww__ffoollllooww(self, anchor)\n Override to suppress rel="nofollow" on some anchors.\n\ncclleeaann__hhttmmll(self, html)\n\nkkiillll__ccoonnddiittiioonnaall__ccoommmmeennttss(self, doc)\n ........ set([\'embed\', \'iframe\'])\nue\n\n@</style></div>'

Revision history for this message
scoder (scoder) wrote :

At least one of the underlying problems is not in clean.py but rather in the HTML parser in libxml2.

>>> from lxml.html import fromstring,tostring
>>> tostring(fromstring("<style>.cake {color: blue;}</ style>"))
'<html><head><style>.cake {color: blue;}</ style></style></head></html>'
>>> fromstring("<style>.cake {color: blue;}</ style>")[0][0].text
'.cake {color: blue;}</ style>'

So the closing style tag is not parsed as a tag but rather as text.

I'm surprised about the text duplication, though.

Revision history for this message
scoder (scoder) wrote :

The text duplication bug appears to have been fixed in libxml2 2.7.8. Closing this bug as invalid for lxml.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.