I'm looking into this now and have some more info.
Bleach might be deprecated but developers promised that it will still receive security fixes and will stay compatible with new Pythons so it should still be safe to use.
Another alternative I see among projects moving away from bleach/lxml is nh3, a Python binding for ammonia project written in Rust. Ammonia seems to be security-focused, whitelist-based, and fast. nh3 uses the latest pyo3 version (compatible with Python 3.12) and provides wheels built for stable ABI so it should work for everybody. It also seems straightforward to port a code from bleach to nh3 and nh3 is also much faster.
I've also looked at projects using clean_html or Cleaner from lxml. I've used the top 5000 PyPI projects, grep.app (https://grep.app/) and looked also into sources of RPM packages in Fedora Linux which depend on python3-lxml package. I've omitted projects already mentioned above.
In Fedora Linux, I found only two occurrences:
* python-readability-lxml
* calibre (this package bundles its own version of readability)
* Add nh3 as an alternative to the documentation and deprecation warning to Cleaner and clean_html.
* Open issues for identified projects. (I can do that)
* Finish the lxml-html-clean package (https://github.com/hrnciar/lxml-html-clean) (We can do it together with Tomáš)
* Add the new package as an extra dependency for lxml so one can install `lxml[clean_html]` and get both lxml and lxml-clean-html installed.
* Remove the deprecated part and make a new release.
I'm looking into this now and have some more info.
Bleach might be deprecated but developers promised that it will still receive security fixes and will stay compatible with new Pythons so it should still be safe to use.
Another alternative I see among projects moving away from bleach/lxml is nh3, a Python binding for ammonia project written in Rust. Ammonia seems to be security-focused, whitelist-based, and fast. nh3 uses the latest pyo3 version (compatible with Python 3.12) and provides wheels built for stable ABI so it should work for everybody. It also seems straightforward to port a code from bleach to nh3 and nh3 is also much faster.
https:/ /github. com/rust- ammonia/ ammonia /github. com/messense/ nh3
https:/
I've also looked at projects using clean_html or Cleaner from lxml. I've used the top 5000 PyPI projects, grep.app (https:/ /grep.app/) and looked also into sources of RPM packages in Fedora Linux which depend on python3-lxml package. I've omitted projects already mentioned above.
In Fedora Linux, I found only two occurrences:
* python- readability- lxml
* calibre (this package bundles its own version of readability)
in the top 5000 projects I've found:
* requests-html: https:/ /github. com/kennethreit z/requests- html/blob/ master/ requests_ html.py# L30
And via grep.app, where support for regexes is limited:
* https:/ /github. com/ysim/ songtext/ blob/master/ libsongtext/ lyricwiki. py /github. com/lorien/ weblib/ blob/master/ weblib/ feed.py /github. com/ColdHeat/ pybluemonday/ blob/master/ benchmarks. py (just some benchmark) /github. com/nopper/ twittomatic/ blob/master/ helpers/ hadoop/ wikipedia/ words/cleanup. py /github. com/python- gsoc/python- blogs/blob/ master/ aldryn_ newsblog/ utils/utilities .py /github. com/Linbreux/ wikmd/blob/ main/wiki. py /github. com/divio/ aldryn- search/ blob/master/ aldryn_ search/ utils.py /github. com/PacktPublis hing/PythonData AnalysisCookboo k/blob/ master/ Chapter% 205/processing_ html.py /github. com/divio/ aldryn- newsblog/ blob/master/ aldryn_ newsblog/ utils/utilities .py (already archived) /github. com/DMOJ/ online- judge/blob/ master/ judge/migration s/0091_ compiler_ message_ ansi2html. py /github. com/janeczku/ calibre- web/blob/ master/ cps/editbooks. py (handles ImportError) /github. com/neuml/ paperai/ blob/master/ examples/ search. py /github. com/khamidou/ kite/blob/ master/ src/back/ kite/maildir. py (already archived) /github. com/NikolaiT/ GoogleScraper/ blob/master/ GoogleScraper/ caching. py /github. com/kootenpv/ sky/blob/ master/ sky/standalone/ monitorPage. py /github. com/anyant/ rssant/ blob/master/ rssant_ feedlib/ processor. py
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
* https:/
As next steps, we can:
* Add nh3 as an alternative to the documentation and deprecation warning to Cleaner and clean_html. /github. com/hrnciar/ lxml-html- clean) (We can do it together with Tomáš)
* Open issues for identified projects. (I can do that)
* Finish the lxml-html-clean package (https:/
* Add the new package as an extra dependency for lxml so one can install `lxml[clean_html]` and get both lxml and lxml-clean-html installed.
* Remove the deprecated part and make a new release.
What do you think about this plan?