Edgewall Software

Opened 11 years ago

Last modified 7 years ago

#542 new defect

Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement

Reported by: jholg@… Owned by: cmlenz
Priority: major Milestone: 0.9
Component: General Version: 0.6
Keywords: Cc:

Description

I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities.

A minimal example HTML like

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Test Genshi UnicodeDecodeError</title>
    </head>
    <body>
        <h1>Heäder <a href="http://genshi.edgewall.org"
                title="&lt;strong&gt;Germän Ümläüts äre nö fün&lt;/strong&gt;"> Germän Ümläüts äre
                nö fün </a></h1>
    </body>
</html>

will produce this error:

>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"), encoding='utf-8').parse()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py", line 272, in _ensure
    event = stream.next()
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 432, in _coalesce
    for kind, data, pos in chain(stream, [(None, None, None)]):
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 327, in _generate
    self.feed(data)
  File "/usr/lib64/python2.7/HTMLParser.py", line 114, in feed
    self.goahead(0)
  File "/usr/lib64/python2.7/HTMLParser.py", line 158, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib64/python2.7/HTMLParser.py", line 305, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib64/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/export/python/virtualenvs/trac/lib64/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while the docs recommend feeding it with unicode.

There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place.

I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout.

In the 0.6.1 tag directory this doesn't look fixed, though.

Attachments (1)

genshi_unicodedecodeerror.xhtml (525 bytes) - added by jholg@… 11 years ago.
Genshi 0.6 UnicodeDecodeError? test file

Download all attachments as: .zip

Change History (5)

comment:1 follow-up: Changed 11 years ago by cboos

Try the "Zip Archive" link at the bottom of the trunk page.

comment:2 in reply to: ↑ 1 Changed 11 years ago by jholg@…

Replying to cboos:

Try the "Zip Archive" link at the bottom of the trunk page.

Thanks for the hint, I overlooked that (for years, obviously). Just did so and can confirm that trunk (r1236) does not seem to suffer from this problem, indeed:

>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"),
encoding='utf-8').parse()))
[('PI', (u'xml', u'version="1.0" encoding="UTF-8"'), (None, 1, 0)), ('TEXT', 
u'\n\n', (None, 1, 38)), ('START', (QName('html'), Attrs([(QName('xmlns'), 
u'http://www.w3.org/1999/xhtml')])), (None, 4, 0)), ('TEXT', u'\n    ', (None, 4, 
43)), ('START', (QName('head'), Attrs()), (None, 5, 4)), ('TEXT', u'\n        ', 
(None, 5, 10)), ('START', (QName('title'), Attrs()), (None, 6, 8)), ('TEXT', u'Test 
Genshi UnicodeDecodeError', (None, 6, 15)), ('END', QName('title'), (None, 6, 45)), 
('TEXT', u'\n    ', (None, 6, 53)), ('END', QName('head'), (None, 7, 4)), ('TEXT', 
u'\n    ', (None, 7, 11)), ('START', (QName('body'), Attrs()), (None, 8, 4)), 
('TEXT', u'\n        ', (None, 8, 10)), ('START', (QName('h1'), Attrs()), (None, 9, 
8)), ('TEXT', u'He\xe4der ', (None, 9, 12)), ('START', (QName('a'), 
Attrs([(QName('href'), u'http://genshi.edgewall.org'), (QName('title'), 
u'<strong>Germ\xe4n \xdcml\xe4\xfcts \xe4re n\xf6 f\xfcn</strong>')])), (None, 9, 
19)), ('TEXT', u' Germ\xe4n \xdcml\xe4\xfcts \xe4re\n                n\xf6 f\xfcn ', 
(None, 10, 79)), ('END', QName('a'), (None, 11, 23)), ('END', QName('h1'), (None, 
11, 27)), ('TEXT', u'\n    ', (None, 11, 32)), ('END', QName('body'), (None, 12, 
4)), ('TEXT', u'\n', (None, 12, 11)), ('END', QName('html'), (None, 13, 0))]

Tested as a standalone test in a virtualenv with only genshi trunk, not in combination with a Trac instance.

Changed 11 years ago by jholg@…

Genshi 0.6 UnicodeDecodeError? test file

comment:3 Changed 11 years ago by jholg@…

Added a test html file for convenience, contents as above.

Command line test:

0 $ /var/tmp/testenv/bin/python -c 'from genshi.input import HTMLParser; import urllib2; list(iter(HTMLParser(urllib2.urlopen("file:genshi_unicodedecodeerror.xhtml"), encoding="utf-8").parse()))'
0 $

comment:4 Changed 7 years ago by hodgestar

  • Milestone changed from 0.7 to 0.9

Moved to milestone 0.9.

Note: See TracTickets for help on using tickets.