Ticket #542 (new defect)
Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement
| Reported by: | jholg@… | Owned by: | cmlenz |
|---|---|---|---|
| Priority: | major | Milestone: | 0.7 |
| Component: | General | Version: | 0.6 |
| Keywords: | Cc: |
Description
I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities.
A minimal example HTML like
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test Genshi UnicodeDecodeError</title>
</head>
<body>
<h1>Heäder <a href="http://genshi.edgewall.org"
title="<strong>Germän Ümläüts äre nö fün</strong>"> Germän Ümläüts äre
nö fün </a></h1>
</body>
</html>
will produce this error:
>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"), encoding='utf-8').parse()))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py", line 272, in _ensure
event = stream.next()
File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 432, in _coalesce
for kind, data, pos in chain(stream, [(None, None, None)]):
File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 327, in _generate
self.feed(data)
File "/usr/lib64/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/usr/lib64/python2.7/HTMLParser.py", line 158, in goahead
k = self.parse_starttag(i)
File "/usr/lib64/python2.7/HTMLParser.py", line 305, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib64/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/export/python/virtualenvs/trac/lib64/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while the docs recommend feeding it with unicode.
There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place.
I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout.
In the 0.6.1 tag directory this doesn't look fixed, though.

