Edgewall Software

Ticket #542 (new defect)

Opened 18 months ago

Last modified 18 months ago

Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement

Reported by: jholg@… Owned by: cmlenz
Priority: major Milestone: 0.7
Component: General Version: 0.6
Keywords: Cc:

Description

I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities.

A minimal example HTML like

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Test Genshi UnicodeDecodeError</title>
    </head>
    <body>
        <h1>Heäder <a href="http://genshi.edgewall.org"
                title="&lt;strong&gt;Germän Ümläüts äre nö fün&lt;/strong&gt;"> Germän Ümläüts äre
                nö fün </a></h1>
    </body>
</html>

will produce this error:

>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"), encoding='utf-8').parse()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py", line 272, in _ensure
    event = stream.next()
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 432, in _coalesce
    for kind, data, pos in chain(stream, [(None, None, None)]):
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 327, in _generate
    self.feed(data)
  File "/usr/lib64/python2.7/HTMLParser.py", line 114, in feed
    self.goahead(0)
  File "/usr/lib64/python2.7/HTMLParser.py", line 158, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib64/python2.7/HTMLParser.py", line 305, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib64/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/export/python/virtualenvs/trac/lib64/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while  the docs recommend feeding it with unicode.

There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place.

I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout.

In the 0.6.1 tag directory this doesn't look fixed, though.

Attachments

genshi_unicodedecodeerror.xhtml Download (0.5 KB) - added by jholg@… 18 months ago.
Genshi 0.6 UnicodeDecodeError? test file

Change History

follow-up: ↓ 2   Changed 18 months ago by cboos

Try the "Zip Archive" link at the bottom of the trunk page.

in reply to: ↑ 1   Changed 18 months ago by jholg@…

Replying to cboos:

Try the "Zip Archive" link at the bottom of the trunk page.

Thanks for the hint, I overlooked that (for years, obviously). Just did so and can confirm that trunk (r1236) does not seem to suffer from this problem, indeed:

>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"),
encoding='utf-8').parse()))
[('PI', (u'xml', u'version="1.0" encoding="UTF-8"'), (None, 1, 0)), ('TEXT', 
u'\n\n', (None, 1, 38)), ('START', (QName('html'), Attrs([(QName('xmlns'), 
u'http://www.w3.org/1999/xhtml')])), (None, 4, 0)), ('TEXT', u'\n    ', (None, 4, 
43)), ('START', (QName('head'), Attrs()), (None, 5, 4)), ('TEXT', u'\n        ', 
(None, 5, 10)), ('START', (QName('title'), Attrs()), (None, 6, 8)), ('TEXT', u'Test 
Genshi UnicodeDecodeError', (None, 6, 15)), ('END', QName('title'), (None, 6, 45)), 
('TEXT', u'\n    ', (None, 6, 53)), ('END', QName('head'), (None, 7, 4)), ('TEXT', 
u'\n    ', (None, 7, 11)), ('START', (QName('body'), Attrs()), (None, 8, 4)), 
('TEXT', u'\n        ', (None, 8, 10)), ('START', (QName('h1'), Attrs()), (None, 9, 
8)), ('TEXT', u'He\xe4der ', (None, 9, 12)), ('START', (QName('a'), 
Attrs([(QName('href'), u'http://genshi.edgewall.org'), (QName('title'), 
u'<strong>Germ\xe4n \xdcml\xe4\xfcts \xe4re n\xf6 f\xfcn</strong>')])), (None, 9, 
19)), ('TEXT', u' Germ\xe4n \xdcml\xe4\xfcts \xe4re\n                n\xf6 f\xfcn ', 
(None, 10, 79)), ('END', QName('a'), (None, 11, 23)), ('END', QName('h1'), (None, 
11, 27)), ('TEXT', u'\n    ', (None, 11, 32)), ('END', QName('body'), (None, 12, 
4)), ('TEXT', u'\n', (None, 12, 11)), ('END', QName('html'), (None, 13, 0))]

Tested as a standalone test in a virtualenv with only genshi trunk, not in combination with a Trac instance.

Changed 18 months ago by jholg@…

Genshi 0.6 UnicodeDecodeError? test file

  Changed 18 months ago by jholg@…

Added a test html file Download for convenience, contents as above.

Command line test:

0 $ /var/tmp/testenv/bin/python -c 'from genshi.input import HTMLParser; import urllib2; list(iter(HTMLParser(urllib2.urlopen("file:genshi_unicodedecodeerror.xhtml"), encoding="utf-8").parse()))'
0 $

Add/Change #542 (Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement)

Author


E-mail address and user name can be saved in the Preferences.


Change Properties
<Author field>
Action
as new
as The resolution will be set. Next status will be 'closed'
to The owner will change from cmlenz. Next status will be 'new'
The owner will change from cmlenz to anonymous. Next status will be 'assigned'
 
Note: See TracTickets for help on using tickets.