Python2.5 Unicode-bug when using sgmllib.py: UnicodeDecodeError
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Python |
Fix Committed
|
Unknown
|
|||
python2.5 (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
python2.6 (Ubuntu) |
Fix Released
|
Medium
|
Unassigned |
Bug Description
I HAD TO CHECK "I don't know" the package. It couldn't find python2.5. Strange.
The bug is described here:
http://
John Nagle have explained and solved the bug:
Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the
code for handling character escapes assumes that ASCII characters have
values up to 255.
But the correct limit is 127, of course.
If a Unicode string is run through SGMLparser, and that string has a
character in an attribute with a value between 128 and 255, which is valid
in Unicode, the value is passed through as a character with "chr", creating a
one-character invalid ASCII string.
Then, when the bad string is later converted to Unicode as the output is
assembled, the UnicodeDecodeError exception is raised.
So the fix is to change 255 to 127 in convert_charref in sgmllib.py,
as shown below. This forces characters above 127 to be expressed with
escape sequences. Please patch accordingly. Thanks.
def convert_
"""Convert character reference, may be overridden."""
try:
n = int(name)
except ValueError:
return
if not 0 <= n <= 127 : # ASCII ends at 127, not 255
return
return self.convert_
Changed in python2.5: | |
importance: | Undecided → Medium |
status: | New → Triaged |
Changed in python: | |
status: | Unknown → New |
Changed in python: | |
status: | New → Fix Committed |
Changed in python2.6 (Ubuntu): | |
importance: | Undecided → Medium |
status: | New → Triaged |
status: | Triaged → In Progress |
Changed in python2.5 (Ubuntu): | |
status: | Triaged → In Progress |
Wednesday 18 June 2008 skrev Flemming Bjerke:
> Public bug reported:
>
> I HAD TO CHECK "I don't know" the package. It couldn't find python2.5.
> Strange.
The problem turned up after upgrade to python2.5. The html-parser modules
called beautifulsoup (not included in python2.5, but relies on sgmlib.py)
stopped working.
--
Flemming Bjerke
Hyldebjerg 67
DK-4330 Hvalsø
Phone: +45 46928846
Mobile: +45 22120366