Edgewall Software

Opened 4 years ago

Last modified 6 days ago

#542 new defect

Genshi UnicodeDecodeError due to non-ascii characters in element attribute entity replacement

Reported by: jholg@… Owned by: cmlenz
Priority: major Milestone: 0.9
Component: General Version: 0.6
Keywords: Cc:


I ran into a situation where using the genshi.input.HTMLParser fails due to an (X)HTML page that contains non-ascii characters in element tag attributes which also happen to contain character entities.

A minimal example HTML like

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="http://www.w3.org/1999/xhtml">
        <title>Test Genshi UnicodeDecodeError</title>
        <h1>Heäder <a href="http://genshi.edgewall.org"
                title="&lt;strong&gt;Germän Ümläüts äre nö fün&lt;/strong&gt;"> Germän Ümläüts äre
                nö fün </a></h1>

will produce this error:

>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"), encoding='utf-8').parse()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/core.py", line 272, in _ensure
    event = stream.next()
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 432, in _coalesce
    for kind, data, pos in chain(stream, [(None, None, None)]):
  File "/export/python/virtualenvs/trac/lib/python2.7/site-packages/genshi/input.py", line 327, in _generate
  File "/usr/lib64/python2.7/HTMLParser.py", line 114, in feed
  File "/usr/lib64/python2.7/HTMLParser.py", line 158, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib64/python2.7/HTMLParser.py", line 305, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib64/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/export/python/virtualenvs/trac/lib64/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

The reason for this seems to be that the data read from the file gets fed to the base Python HTMLParser's feed() method as an encoded string, while the docs recommend feeding it with unicode.

There are several places where a string decode() is performed in the event handler methods of genshi.input.HTMLParser so fixing this will probably mean a slight redesign of where & when decoding/encoding takes place.

I'm currently running 0.6. From looking at the trunk sources this may have been fixed there - unfortunately I can't test this as the proxy I'm behing denies svn checkout.

In the 0.6.1 tag directory this doesn't look fixed, though.

Attachments (1)

genshi_unicodedecodeerror.xhtml (525 bytes) - added by jholg@… 4 years ago.
Genshi 0.6 UnicodeDecodeError? test file

Download all attachments as: .zip

Change History (5)

comment:1 follow-up: Changed 4 years ago by cboos

Try the "Zip Archive" link at the bottom of the trunk page.

comment:2 in reply to: ↑ 1 Changed 4 years ago by jholg@…

Replying to cboos:

Try the "Zip Archive" link at the bottom of the trunk page.

Thanks for the hint, I overlooked that (for years, obviously). Just did so and can confirm that trunk (r1236) does not seem to suffer from this problem, indeed:

>>> list(iter(HTMLParser(urllib2.urlopen("file:///data/tmp/genshi_unicodedecodeerror.xhtml"),
[('PI', (u'xml', u'version="1.0" encoding="UTF-8"'), (None, 1, 0)), ('TEXT', 
u'\n\n', (None, 1, 38)), ('START', (QName('html'), Attrs([(QName('xmlns'), 
u'http://www.w3.org/1999/xhtml')])), (None, 4, 0)), ('TEXT', u'\n    ', (None, 4, 
43)), ('START', (QName('head'), Attrs()), (None, 5, 4)), ('TEXT', u'\n        ', 
(None, 5, 10)), ('START', (QName('title'), Attrs()), (None, 6, 8)), ('TEXT', u'Test 
Genshi UnicodeDecodeError', (None, 6, 15)), ('END', QName('title'), (None, 6, 45)), 
('TEXT', u'\n    ', (None, 6, 53)), ('END', QName('head'), (None, 7, 4)), ('TEXT', 
u'\n    ', (None, 7, 11)), ('START', (QName('body'), Attrs()), (None, 8, 4)), 
('TEXT', u'\n        ', (None, 8, 10)), ('START', (QName('h1'), Attrs()), (None, 9, 
8)), ('TEXT', u'He\xe4der ', (None, 9, 12)), ('START', (QName('a'), 
Attrs([(QName('href'), u'http://genshi.edgewall.org'), (QName('title'), 
u'<strong>Germ\xe4n \xdcml\xe4\xfcts \xe4re n\xf6 f\xfcn</strong>')])), (None, 9, 
19)), ('TEXT', u' Germ\xe4n \xdcml\xe4\xfcts \xe4re\n                n\xf6 f\xfcn ', 
(None, 10, 79)), ('END', QName('a'), (None, 11, 23)), ('END', QName('h1'), (None, 
11, 27)), ('TEXT', u'\n    ', (None, 11, 32)), ('END', QName('body'), (None, 12, 
4)), ('TEXT', u'\n', (None, 12, 11)), ('END', QName('html'), (None, 13, 0))]

Tested as a standalone test in a virtualenv with only genshi trunk, not in combination with a Trac instance.

Changed 4 years ago by jholg@…

Genshi 0.6 UnicodeDecodeError? test file

comment:3 Changed 4 years ago by jholg@…

Added a test html file for convenience, contents as above.

Command line test:

0 $ /var/tmp/testenv/bin/python -c 'from genshi.input import HTMLParser; import urllib2; list(iter(HTMLParser(urllib2.urlopen("file:genshi_unicodedecodeerror.xhtml"), encoding="utf-8").parse()))'
0 $

comment:4 Changed 6 days ago by hodgestar

  • Milestone changed from 0.7 to 0.9

Moved to milestone 0.9.

Add Comment

Modify Ticket

Change Properties
Set your email in Preferences
as new The owner will remain cmlenz.
as The resolution will be set. Next status will be 'closed'.
to The owner will be changed from cmlenz to the specified user. Next status will be 'new'.
The owner will be changed from cmlenz to anonymous. Next status will be 'assigned'.

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.