id,summary,reporter,owner,description,type,status,priority,milestone,component,version,resolution,keywords,cc 542,Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement,jholg@…,cmlenz,"I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities. A minimal example HTML like {{{ Test Genshi UnicodeDecodeError

Heäder Germän Ümläüts äre nö fün

}}} will produce this error: {{{ >>> list(iter(HTMLParser(urllib2.urlopen(""file:///data/tmp/genshi_unicodedecodeerror.xhtml""), encoding='utf-8').parse())) Traceback (most recent call last): File """", line 1, in File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py"", line 272, in _ensure event = stream.next() File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py"", line 432, in _coalesce for kind, data, pos in chain(stream, [(None, None, None)]): File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py"", line 327, in _generate self.feed(data) File ""/usr/lib64/python2.7/HTMLParser.py"", line 114, in feed self.goahead(0) File ""/usr/lib64/python2.7/HTMLParser.py"", line 158, in goahead k = self.parse_starttag(i) File ""/usr/lib64/python2.7/HTMLParser.py"", line 305, in parse_starttag attrvalue = self.unescape(attrvalue) File ""/usr/lib64/python2.7/HTMLParser.py"", line 472, in unescape return re.sub(r""&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));"", replaceEntities, s) File ""/export/python/virtualenvs/trac/lib64/python2.7/re.py"", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) }}} The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while [http://docs.python.org/2/library/htmlparser.html#HTMLParser.HTMLParser.feed the docs recommend feeding it with unicode]. There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place. I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout. In the 0.6.1 tag directory this doesn't look fixed, though. ",defect,new,major,0.9,General,0.6,,,