id,summary,reporter,owner,description,type,status,priority,milestone,component,version,resolution,keywords,cc
542,Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement,jholg@…,cmlenz,"I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities.
A minimal example HTML like
{{{
Test Genshi UnicodeDecodeError
}}}
will produce this error:
{{{
>>> list(iter(HTMLParser(urllib2.urlopen(""file:///data/tmp/genshi_unicodedecodeerror.xhtml""), encoding='utf-8').parse()))
Traceback (most recent call last):
File """", line 1, in
File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py"", line 272, in _ensure
event = stream.next()
File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py"", line 432, in _coalesce
for kind, data, pos in chain(stream, [(None, None, None)]):
File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py"", line 327, in _generate
self.feed(data)
File ""/usr/lib64/python2.7/HTMLParser.py"", line 114, in feed
self.goahead(0)
File ""/usr/lib64/python2.7/HTMLParser.py"", line 158, in goahead
k = self.parse_starttag(i)
File ""/usr/lib64/python2.7/HTMLParser.py"", line 305, in parse_starttag
attrvalue = self.unescape(attrvalue)
File ""/usr/lib64/python2.7/HTMLParser.py"", line 472, in unescape
return re.sub(r""&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));"", replaceEntities, s)
File ""/export/python/virtualenvs/trac/lib64/python2.7/re.py"", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
}}}
The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while [http://docs.python.org/2/library/htmlparser.html#HTMLParser.HTMLParser.feed the docs recommend feeding it with unicode].
There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place.
I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout.
In the 0.6.1 tag directory this doesn't look fixed, though.
",defect,new,major,0.9,General,0.6,,,