Edgewall Software

Opened 10 years ago

Last modified 15 months ago

#184 new defect

str encoding in input — at Version 4

Reported by: brickenstein@… Owned by: cmlenz
Priority: major Milestone: 0.9
Component: General Version: 0.4.4
Keywords: encoding Cc:

Description (last modified by cmlenz)

Hi!

I am experiencing problems with strings containing non-ascii characters in the input.

 --> parse stage: 20.0000 ms
Traceback (most recent call last):
  File "run.py", line 46, in <module>
    test()
  File "run.py", line 22, in test
    print tmpl.generate(**data).render(method='html')
  File "/home/michael/Genshi-0.4.4/genshi/core.py", line 154, in render
    return encode(generator, method=method, encoding=encoding)
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 45, in encode
    output = u''.join(list(iterator))
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 369, in __call__
    for kind, data, pos in stream:
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 618, in __call__
    for kind, data, pos in stream:
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 688, in __call__
    text = mjoin(textbuf, escape_quotes=False)
  File "/home/michael/Genshi-0.4.4/genshi/core.py", line 379, in join
    for item in seq]))
  File "/home/michael/Genshi-0.4.4/genshi/core.py", line 405, in escape
    text = unicode(text).replace('&', '&amp;') \
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

I attach a patch, which solves the problem for me, but fixes the assumed encoding to 'utf-8'. A better solution would be to have a variable assume_encoding, as in kid.

As an example, I attach a modified run.py of the examples/basic, where I replaced <world> by Wörld

This ticket is related to http://code.google.com/p/dbsprockets/issues/detail?id=54

Thank you very much in advance. Best regards, Michael

Change History (6)

Changed 10 years ago by brickenstein@…

Changed 10 years ago by brickenstein@…

comment:1 Changed 10 years ago by cboos

... fixes the assumed encoding to 'utf-8'. A better solution would be to have a variable assume_encoding, as in kid.

Or rather, make clear in the documentation that Markup has to be given exclusively unicode objects or base types that convert cleanly to unicode (ascii strings, numbers).

In the general case, when you have no clue about the encoding of a str, you need a more robust approach, like the to_unicode hammer which is used in Trac (see Trac:source:/trunk/trac/util/text.py).

comment:2 Changed 10 years ago by cmlenz

Genshi does not support bytestrings, it either has to be unicode or encoded using the default encoding. I'd need to check whether this is actually properly documented.

Explicitly defaulting to UTF-8 for bytestrings might make sense. At least if it's not UTF-8, you then get an immediate error instead random wrong chars.

comment:3 Changed 10 years ago by anonymous

I am having a very similar error.

System Information
Trac: 0.11b1 
Python: 2.4.2 (#1, Jan 10 2008, 17:43:47) [GCC 4.1.2 20070115 (prerelease) (SUSE Linux)] 
setuptools: 0.6c8 
SQLite: 3.2.8 
pysqlite: 2.4.1 
Genshi: 0.5dev-r801 
Pygments: 0.9 
Subversion: 1.3.1 (r19032) 
jQuery: 1.2.1

When I perform a simple search of tickets for the word "character", I get the following error:

 Traceback (most recent call last):
  File "/usr/local/lib64/python2.4/site-packages/Trac-0.11b1-py2.4.egg/trac/web/api.py", line 339, in send_error
    'text/html')
  File "/usr/local/lib64/python2.4/site-packages/Trac-0.11b1-py2.4.egg/trac/web/chrome.py", line 683, in render_template
    return stream.render(method, doctype=doctype)
  File "build/bdist.linux-x86_64/egg/genshi/core.py", line 172, in render
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 45, in encode
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 291, in __call__
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 714, in __call__
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 553, in __call__
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 668, in __call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 812: ordinal not in range(128)

comment:4 Changed 10 years ago by cmlenz

  • Description modified (diff)
  • Milestone changed from 0.5 to 0.6

I'm still thinking about whether to make UTF-8 the default, but this will have to wait for the next release.

Note: See TracTickets for help on using tickets.