Opened 12 years ago
Last modified 8 years ago
#542 new defect
Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement
Reported by: | jholg@… | Owned by: | cmlenz |
---|---|---|---|
Priority: | major | Milestone: | 0.9 |
Component: | General | Version: | 0.6 |
Keywords: | Cc: |
Description
I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities.
A minimal example HTML like
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Test Genshi UnicodeDecodeError</title> </head> <body> <h1>Heäder <a href="http://genshi.edgewall.org" title="<strong>Germän Ümläüts äre nö fün</strong>"> Germän Ümläüts äre nö fün </a></h1> </body> </html>
will produce this error:
>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"), encoding='utf-8').parse())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py", line 272, in _ensure event = stream.next() File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 432, in _coalesce for kind, data, pos in chain(stream, [(None, None, None)]): File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 327, in _generate self.feed(data) File "/usr/lib64/python2.7/HTMLParser.py", line 114, in feed self.goahead(0) File "/usr/lib64/python2.7/HTMLParser.py", line 158, in goahead k = self.parse_starttag(i) File "/usr/lib64/python2.7/HTMLParser.py", line 305, in parse_starttag attrvalue = self.unescape(attrvalue) File "/usr/lib64/python2.7/HTMLParser.py", line 472, in unescape return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s) File "/export/python/virtualenvs/trac/lib64/python2.7/re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while the docs recommend feeding it with unicode.
There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place.
I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout.
In the 0.6.1 tag directory this doesn't look fixed, though.
Attachments (1)
Change History (5)
comment:1 follow-up: ↓ 2 Changed 12 years ago by cboos
comment:2 in reply to: ↑ 1 Changed 12 years ago by jholg@…
Replying to cboos:
Try the "Zip Archive" link at the bottom of the trunk page.
Thanks for the hint, I overlooked that (for years, obviously). Just did so and can confirm that trunk (r1236) does not seem to suffer from this problem, indeed:
>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"), encoding='utf-8').parse())) [('PI', (u'xml', u'version="1.0" encoding="UTF-8"'), (None, 1, 0)), ('TEXT', u'\n\n', (None, 1, 38)), ('START', (QName('html'), Attrs([(QName('xmlns'), u'http://www.w3.org/1999/xhtml')])), (None, 4, 0)), ('TEXT', u'\n ', (None, 4, 43)), ('START', (QName('head'), Attrs()), (None, 5, 4)), ('TEXT', u'\n ', (None, 5, 10)), ('START', (QName('title'), Attrs()), (None, 6, 8)), ('TEXT', u'Test Genshi UnicodeDecodeError', (None, 6, 15)), ('END', QName('title'), (None, 6, 45)), ('TEXT', u'\n ', (None, 6, 53)), ('END', QName('head'), (None, 7, 4)), ('TEXT', u'\n ', (None, 7, 11)), ('START', (QName('body'), Attrs()), (None, 8, 4)), ('TEXT', u'\n ', (None, 8, 10)), ('START', (QName('h1'), Attrs()), (None, 9, 8)), ('TEXT', u'He\xe4der ', (None, 9, 12)), ('START', (QName('a'), Attrs([(QName('href'), u'http://genshi.edgewall.org'), (QName('title'), u'<strong>Germ\xe4n \xdcml\xe4\xfcts \xe4re n\xf6 f\xfcn</strong>')])), (None, 9, 19)), ('TEXT', u' Germ\xe4n \xdcml\xe4\xfcts \xe4re\n n\xf6 f\xfcn ', (None, 10, 79)), ('END', QName('a'), (None, 11, 23)), ('END', QName('h1'), (None, 11, 27)), ('TEXT', u'\n ', (None, 11, 32)), ('END', QName('body'), (None, 12, 4)), ('TEXT', u'\n', (None, 12, 11)), ('END', QName('html'), (None, 13, 0))]
Tested as a standalone test in a virtualenv with only genshi trunk, not in combination with a Trac instance.
comment:3 Changed 12 years ago by jholg@…
Added a test html file for convenience, contents as above.
Command line test:
0 $ /var/tmp/testenv/bin/python -c 'from genshi.input import HTMLParser; import urllib2; list(iter(HTMLParser(urllib2.urlopen("file:genshi_unicodedecodeerror.xhtml"), encoding="utf-8").parse()))' 0 $
comment:4 Changed 8 years ago by hodgestar
- Milestone changed from 0.7 to 0.9
Moved to milestone 0.9.
Try the "Zip Archive" link at the bottom of the trunk page.