﻿id,summary,reporter,owner,description,type,status,priority,milestone,component,version,resolution,keywords,cc
542,Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement,jholg@…,cmlenz,"I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities.

A minimal example HTML like

{{{
<?xml version=""1.0"" encoding=""UTF-8""?>
<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Strict//EN""
                      ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"">
<html xmlns=""http://www.w3.org/1999/xhtml"">
    <head>
        <title>Test Genshi UnicodeDecodeError</title>
    </head>
    <body>
        <h1>Heäder <a href=""http://genshi.edgewall.org""
                title=""&lt;strong&gt;Germän Ümläüts äre nö fün&lt;/strong&gt;""> Germän Ümläüts äre
                nö fün </a></h1>
    </body>
</html>
}}}

will produce this error:

{{{
>>> list(iter(HTMLParser(urllib2.urlopen(""file:///data/tmp/genshi_unicodedecodeerror.xhtml""), encoding='utf-8').parse()))
Traceback (most recent call last):
  File ""<stdin>"", line 1, in <module>
  File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py"", line 272, in _ensure
    event = stream.next()
  File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py"", line 432, in _coalesce
    for kind, data, pos in chain(stream, [(None, None, None)]):
  File ""/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py"", line 327, in _generate
    self.feed(data)
  File ""/usr/lib64/python2.7/HTMLParser.py"", line 114, in feed
    self.goahead(0)
  File ""/usr/lib64/python2.7/HTMLParser.py"", line 158, in goahead
    k = self.parse_starttag(i)
  File ""/usr/lib64/python2.7/HTMLParser.py"", line 305, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File ""/usr/lib64/python2.7/HTMLParser.py"", line 472, in unescape
    return re.sub(r""&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));"", replaceEntities, s)
  File ""/export/python/virtualenvs/trac/lib64/python2.7/re.py"", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

}}}

The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while [http://docs.python.org/2/library/htmlparser.html#HTMLParser.HTMLParser.feed the docs recommend feeding it with unicode].

There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place.

I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout.

In the 0.6.1 tag directory this doesn't look fixed, though.



",defect,new,major,0.9,General,0.6,,,
