Change html5 formatter to escape only ambiguous ampersands
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
```
In [1]: import bs4
In [2]: bs4.__version__
Out[2]: '4.9.3'
In [3]: st = bs4.BeautifulSo
In [4]: st
Out[4]: <html><body><a href="https:/
```
It's impossible to create a `<a>` tag where the `href` parameter has a query string with multiple parameters without bs4 escaping the ampersands in the query string, therefore breaking the link.
I think in general, automatically trying to escape any tag attributes is a deeply problematic behaviour. Escaping tag *contents* makes sense, but the attributes don't follow the same ruleset.
As a horrible hack, going into element.py and special-casing for `href` attributes makes things work, through it's probably super buggy:
```
decoded = key
val = ' '.join(val)
elif not isinstance(val, str):
val = str(val)
elif (
isinstance( val, AttributeValueW ithCharsetSubst itution)
and eventual_encoding is not None
val = val.encode( eventual_ encoding)
for key, val in attributes:
if val is None:
else:
if isinstance(val, list) or isinstance(val, tuple):
):
close = ''
closeTag = ''
```
to
```
decoded = key
val = ' '.join(val)
elif not isinstance(val, str):
val = str(val)
elif (
isinstance( val, AttributeValueW ithCharsetSubst itution)
and eventual_encoding is not None
val = val.encode( eventual_ encoding)
text = str(val)
else:
text = formatter. attribute_ value(val)
decoded = (
str( key) + '='
+ formatter. quoted_ attribute_ value(text) )
attrs. append( decoded)
for key, val in attributes:
if val is None:
else:
if isinstance(val, list) or isinstance(val, tuple):
):
if key == 'href':
close = ''
closeTag = ''
```
The core of the issue appears to be that the "minimal" EntitySubstitut ion() instance still replaces "&".
I have no idea what's "correct" from a spec perspective here, but I can say that it seems BS4 is unable to generate the valid HTML output I need here, and I don't *think* having multi-parameter query strings in a anchor tag is invalid in any variant of HTML.