Comment 8 for bug 1958539

Revision history for this message
frenzy (frenzy-madness) wrote :

I'm looking into this now and have some more info.

Bleach might be deprecated but developers promised that it will still receive security fixes and will stay compatible with new Pythons so it should still be safe to use.

Another alternative I see among projects moving away from bleach/lxml is nh3, a Python binding for ammonia project written in Rust. Ammonia seems to be security-focused, whitelist-based, and fast. nh3 uses the latest pyo3 version (compatible with Python 3.12) and provides wheels built for stable ABI so it should work for everybody. It also seems straightforward to port a code from bleach to nh3 and nh3 is also much faster.

https://github.com/rust-ammonia/ammonia
https://github.com/messense/nh3

I've also looked at projects using clean_html or Cleaner from lxml. I've used the top 5000 PyPI projects, grep.app (https://grep.app/) and looked also into sources of RPM packages in Fedora Linux which depend on python3-lxml package. I've omitted projects already mentioned above.

In Fedora Linux, I found only two occurrences:

* python-readability-lxml
* calibre (this package bundles its own version of readability)

in the top 5000 projects I've found:

* requests-html: https://github.com/kennethreitz/requests-html/blob/master/requests_html.py#L30

And via grep.app, where support for regexes is limited:

* https://github.com/ysim/songtext/blob/master/libsongtext/lyricwiki.py
* https://github.com/lorien/weblib/blob/master/weblib/feed.py
* https://github.com/ColdHeat/pybluemonday/blob/master/benchmarks.py (just some benchmark)
* https://github.com/nopper/twittomatic/blob/master/helpers/hadoop/wikipedia/words/cleanup.py
* https://github.com/python-gsoc/python-blogs/blob/master/aldryn_newsblog/utils/utilities.py
* https://github.com/Linbreux/wikmd/blob/main/wiki.py
* https://github.com/divio/aldryn-search/blob/master/aldryn_search/utils.py
* https://github.com/PacktPublishing/PythonDataAnalysisCookbook/blob/master/Chapter%205/processing_html.py
* https://github.com/divio/aldryn-newsblog/blob/master/aldryn_newsblog/utils/utilities.py (already archived)
* https://github.com/DMOJ/online-judge/blob/master/judge/migrations/0091_compiler_message_ansi2html.py
* https://github.com/janeczku/calibre-web/blob/master/cps/editbooks.py (handles ImportError)
* https://github.com/neuml/paperai/blob/master/examples/search.py
* https://github.com/khamidou/kite/blob/master/src/back/kite/maildir.py (already archived)
* https://github.com/NikolaiT/GoogleScraper/blob/master/GoogleScraper/caching.py
* https://github.com/kootenpv/sky/blob/master/sky/standalone/monitorPage.py
* https://github.com/anyant/rssant/blob/master/rssant_feedlib/processor.py

As next steps, we can:

* Add nh3 as an alternative to the documentation and deprecation warning to Cleaner and clean_html.
* Open issues for identified projects. (I can do that)
* Finish the lxml-html-clean package (https://github.com/hrnciar/lxml-html-clean) (We can do it together with Tomáš)
* Add the new package as an extra dependency for lxml so one can install `lxml[clean_html]` and get both lxml and lxml-clean-html installed.
* Remove the deprecated part and make a new release.

What do you think about this plan?