Edgewall Software

Ticket #184 (new defect)

Opened 6 years ago

Last modified 3 years ago

Assume UTF-8 as default encoding of template data (was: str encoding in input)

Reported by: brickenstein@… Owned by: cmlenz
Priority: major Milestone: 0.7
Component: General Version: 0.4.4
Keywords: encoding Cc:

Description (last modified by cmlenz) (diff)

Hi!

I am experiencing problems with strings containing non-ascii characters in the input.

 --> parse stage: 20.0000 ms
Traceback (most recent call last):
  File "run.py", line 46, in <module>
    test()
  File "run.py", line 22, in test
    print tmpl.generate(**data).render(method='html')
  File "/home/michael/Genshi-0.4.4/genshi/core.py", line 154, in render
    return encode(generator, method=method, encoding=encoding)
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 45, in encode
    output = u''.join(list(iterator))
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 369, in __call__
    for kind, data, pos in stream:
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 618, in __call__
    for kind, data, pos in stream:
  File "/home/michael/Genshi-0.4.4/genshi/output.py", line 688, in __call__
    text = mjoin(textbuf, escape_quotes=False)
  File "/home/michael/Genshi-0.4.4/genshi/core.py", line 379, in join
    for item in seq]))
  File "/home/michael/Genshi-0.4.4/genshi/core.py", line 405, in escape
    text = unicode(text).replace('&', '&amp;') \
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

I attach a patch, which solves the problem for me, but fixes the assumed encoding to 'utf-8'. A better solution would be to have a variable assume_encoding, as in kid.

As an example, I attach a modified run.py of the examples/basic, where I replaced <world> by Wörld

This ticket is related to  http://code.google.com/p/dbsprockets/issues/detail?id=54

Thank you very much in advance. Best regards, Michael

Attachments

core.py.patch Download (0.5 KB) - added by brickenstein@… 6 years ago.
run.py Download (1.2 KB) - added by brickenstein@… 6 years ago.

Change History

Changed 6 years ago by brickenstein@…

Changed 6 years ago by brickenstein@…

Changed 6 years ago by cboos

... fixes the assumed encoding to 'utf-8'. A better solution would be to have a variable assume_encoding, as in kid.

Or rather, make clear in the documentation that Markup has to be given exclusively unicode objects or base types that convert cleanly to unicode (ascii strings, numbers).

In the general case, when you have no clue about the encoding of a str, you need a more robust approach, like the to_unicode hammer which is used in Trac (see  Trac:source:/trunk/trac/util/text.py).

Changed 6 years ago by cmlenz

Genshi does not support bytestrings, it either has to be unicode or encoded using the default encoding. I'd need to check whether this is actually properly documented.

Explicitly defaulting to UTF-8 for bytestrings might make sense. At least if it's not UTF-8, you then get an immediate error instead random wrong chars.

Changed 6 years ago by anonymous

I am having a very similar error.

System Information
Trac: 0.11b1 
Python: 2.4.2 (#1, Jan 10 2008, 17:43:47) [GCC 4.1.2 20070115 (prerelease) (SUSE Linux)] 
setuptools: 0.6c8 
SQLite: 3.2.8 
pysqlite: 2.4.1 
Genshi: 0.5dev-r801 
Pygments: 0.9 
Subversion: 1.3.1 (r19032) 
jQuery: 1.2.1

When I perform a simple search of tickets for the word "character", I get the following error:

 Traceback (most recent call last):
  File "/usr/local/lib64/python2.4/site-packages/Trac-0.11b1-py2.4.egg/trac/web/api.py", line 339, in send_error
    'text/html')
  File "/usr/local/lib64/python2.4/site-packages/Trac-0.11b1-py2.4.egg/trac/web/chrome.py", line 683, in render_template
    return stream.render(method, doctype=doctype)
  File "build/bdist.linux-x86_64/egg/genshi/core.py", line 172, in render
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 45, in encode
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 291, in __call__
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 714, in __call__
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 553, in __call__
  File "build/bdist.linux-x86_64/egg/genshi/output.py", line 668, in __call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 812: ordinal not in range(128)

Changed 6 years ago by cmlenz

  • description modified (diff)
  • milestone changed from 0.5 to 0.6

I'm still thinking about whether to make UTF-8 the default, but this will have to wait for the next release.

Changed 6 years ago by cmlenz

See also #224.

Changed 5 years ago by cmlenz

  • summary changed from str encoding in input to Assume UTF-8 as default encoding of template data (was: str encoding in input)

Changed 4 years ago by cmlenz

  • milestone changed from 0.6 to 0.7

Add/Change #184 (Assume UTF-8 as default encoding of template data (was: str encoding in input))

Author


E-mail address and user name can be saved in the Preferences.


Change Properties
<Author field>
Action
as new
as The resolution will be set. Next status will be 'closed'
to The owner will change from cmlenz. Next status will be 'new'
The owner will change from cmlenz to anonymous. Next status will be 'assigned'
 
Note: See TracTickets for help on using tickets.