Cleaning an HTML with more than 254 depth levels
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Undecided
|
Unassigned |
Bug Description
I use Odoo, which uses lxml's html cleaner[1] to cleanup untrusted incoming HTML emails [2].
There was a recent email which was strangely some simple and innocent markup within a depth of 657 nested <span> elements. I have absolutely no idea on how in the world did the sender come up to write such a big amount of nested <span>s, but there they were.
After sanitizing, the mail was wrongly empty (actually a bunch of 254 empty <span> elements). In this similar question [3] you can see easy steps to reproduce the issue, which seems to be more down in the rabbit hole of what clean_html() has.
Is there any way to increase that limit and be able to recurse to more than 254 levels?
[1]: https:/
[2]: https:/
[3]: https:/
This might be due to the safety limits that libxml2's default parser applies in order to defeat DoS attacks with large document content. You could try creating your own self-configured "lxml.html. HTMLParser" for parsing the document that has the "huge_tree=True" option set.
Obviously, disabling the parser limitations opens up your code to DoS attacks, but it's worth a try to see if that's the issue here.