Edgewall Software

Genshi Recipes: Localization

Note: this recipe is now obsolete, as internationalization support has been integrated into Genshi. See Internationalization and Localization.

This is code to aid in localization of Genshi templates, without altering the underlying templates. It was originally written by Matt Good, then updated and fixed up by David Fraser.

How it works

First a word on streams. This operates on the Template streams which are lists of events fairly similar to normal Genshi XML streams, but contain other special things (like EXPRessions and SUBstreams). The reason it needs to operate here is that it should take advantage of the template parsing, but it needs to operate before the contents are merged with the template.

In order to do this, we need three parts:

  • Extraction of the localization text from templates
  • Construction of the localized template stream using the translations and the original template stream
  • Use of the localized template stream to generate the resulting page

The code on this page uses gettext - it generates POT files (although they don't currently contain the required header) and uses the ugettext function to translate the template pages (which will use MO files compiled from PO files containing the translations).

Extraction of Localized Text

Here is a module that can be used to extract text from Genshi template streams into POT files, for translation into different languages

import fnmatch
import os
import re
import logging
import copy

import genshi.core
import genshi.input
import genshi.eval
import genshi.template

ignore_tags = ['script', 'style']
include_attribs = ['title', 'alt']
exclude_dirs = ('.AppleDouble', '.svn', 'CVS', '_darcs')
gettext_re = re.compile(r"_\(((?:'[^']*')|(?:\"[^\"]*\"))\)")

# calculate escapes
escapes = []

def make_escapes(pass_iso8859):
    global escapes
    if pass_iso8859:
        # Allow iso-8859 characters to pass through so that e.g. 'msgid
        # "Höhe"' would result not result in 'msgid "H\366he"'.  Otherwise we
        # escape any character outside the 32..126 range.
        mod = 128
    else:
        mod = 256
    for i in range(256):
        if 32 <= (i % mod) <= 126:
            escapes.append(chr(i))
        else:
            escapes.append("\\%03o" % i)
    escapes[ord('\\')] = '\\\\'
    escapes[ord('\t')] = '\\t'
    escapes[ord('\r')] = '\\r'
    escapes[ord('\n')] = '\\n'
    escapes[ord('\"')] = '\\"'

make_escapes(False)

def escape(s):
    global escapes
    s = list(s)
    for i in range(len(s)):
        s[i] = escapes[ord(s[i])]
    return ''.join(s)

def normalize(s):
    """This converts the various Python string types into a format that is
    appropriate for .po files, namely much closer to C style."""
    lines = s.split('\n')
    if len(lines) == 1:
        s = '"' + escape(s) + '"'
    else:
        if not lines[-1]:
            del lines[-1]
            lines[-1] = lines[-1] + '\n'
        for i in range(len(lines)):
            lines[i] = escape(lines[i])
        lineterm = '\\n"\n"'
        s = '""\n"' + lineterm.join(lines) + '"'
    return s

def lang_extract(potfile, source_files, template_class=None):
    """extracts text strings from the given source files and outputs them at the end of the given pot file"""
    fd = open(potfile, 'at+')
    try:
        keys_found = {}
        key_order = []
        for fname, linenum, key in extract_keys(source_files, ['.'], template_class):
            if key in keys_found:
                keys_found[key].append((fname, linenum))
            else:
                keys_found[key] = [(fname, linenum)]
                key_order.append(key)
        for key in key_order:
            for fname, linenum in keys_found[key]:
                fd.write('#: %s:%s\n' % (fname, linenum))
            fd.write('msgid %s\n' % normalize(key))
            fd.write('msgstr ""\n\n')
    finally:
        fd.close()

def _matching_files(dirname, fileglob):
    """searches for matching filenames in a directory"""
    for root, dirs, files in os.walk(dirname):
        for exclude in exclude_dirs:
            try:
                dirs.remove(exclude)
            except ValueError:
                pass
        for fname in fnmatch.filter(files, fileglob):
            yield os.path.join(root, fname)

def extract_keys(files, search_path=None, template_class=None):
    """finds all the text keys in the given files"""
    loader = genshi.template.TemplateLoader(search_path)
    for fname in files:
        logging.info('Scanning l10n keys from: %s' % fname)
        try:
            if template_class is None:
                template = loader.load(fname)
            else:
                template = loader.load(fname, cls=template_class)
        except genshi.input.ParseError, e:
            logging.warning('Skipping extracting l10n keys from %s: %s' % (fname, e))
            continue
        for linenum, key in extract_from_template(template):
            yield fname, linenum, key

def extract_from_template(template, search_text=True):
    """helper to extract linenumber and key pairs from a given template"""
    return extract_from_stream(template.stream, search_text)

def extract_from_stream(stream, search_text=True):
    """takes a MatchTemplate.stream (not a normal XML Stream) and searches for localizable text, yielding linenumber, text tuples"""
    # search_text is set to false when extracting from substreams (that are attribute values for an attribute which is not text)
    # in this case, only Python strings in expressions are extracted
    stream = iter(stream)
    tagname = None
    skip_level = 0
    for kind, data, pos in stream:
        linenum = pos[1]
        if skip_level:
            if kind is genshi.core.START:
                tag, attrs = data
                if tag.localname in ignore_tags:
                    skip_level += 1
            if kind is genshi.core.END:
                tag = data
                if tag.localname in ignore_tags:
                    skip_level -= 1
            continue
        if kind is genshi.core.START:
            tag, attrs = data
            tagname = tag.localname
            if tagname in ignore_tags:
                # skip the substream
                skip_level += 1
                continue
            for name, value in attrs:
                if isinstance(value, basestring):
                   if search_text and name in include_attribs:
                       yield linenum, value
                else:
                    for dummy, key in extract_from_stream(value,
                                                      name in include_attribs):
                        yield linenum, key
        elif kind is genshi.template.EXPR:
            if data.source != "?":
                # TODO: check if these expressions should be localized
                for key in gettext_re.findall(data.source):
                    key = key[1:-1]
                    if key:
                        yield linenum, key
        elif kind is genshi.core.TEXT and search_text:
            key = data.strip()
            if key:
                yield linenum, key
        elif kind is genshi.template.SUB:
            sub_kind, sub_stream = data
            for linenum, key in extract_from_stream(sub_stream, search_text):
                yield linenum, key

Localization of the Template Stream at Run Time

The following function can then be used to localize the template stream (see below for details on use) The reason that the ugettext is passed in as a function, is that language selection etc needs to happen depending on the language of the user submitting the request, not the machine serving the pages. You can thus pass in a specialized ugettext function that uses the appropriate language for the current user.

def localize_template(template_source_stream, ugettext, search_text=True):
    """localizes the given template source stream (i.e. genshi.XML(template_source), not the parsed template's stream
    need to pass in the ugettext function you want to use"""
    # NOTE: this MUST NOT modify the underlying objects or template reuse will break
    # in addition, if it calls itself recursively it must convert the result to a list or it will break on repetition
    # search_text is set to false when extracting from substreams (that are attribute values for an attribute which is not text)
    # in this case, only Python strings in expressions are extracted
    stream = iter(template_source_stream)
    skip_level = 0
    for kind, data, pos in stream:
        # handle skipping whole chunks we don't want to localize (just yielding everything in them)
        if skip_level:
            if kind is genshi.core.START:
                tag, attrs = data
                tag = tag.localname
                if tag in ignore_tags:
                    skip_level += 1
            if kind is genshi.core.END:
                tag = data.localname
                if tag in ignore_tags:
                    skip_level -= 1
            yield kind, data, pos
            continue
        # handle different kinds of things we want to localize
        if kind is genshi.core.START:
            tag, attrs = data
            tagname = tag.localname
            if tagname in ignore_tags:
                skip_level += 1
                yield kind, data, pos
                continue
            new_attrs = genshi.core.Attrs(attrs[:])
            changed = False
            for name, value in attrs:
                if isinstance(value, basestring):
                   if search_text and name in include_attribs:
                       new_value = ugettext(search_text)
                       new_attrs.set(name, new_value)
                       changed = True
                else:
                    # this seems to be handling substreams, so we should get back a localized substream
                    # note: passing search_text=False implies far fewer matches, this may be wasteful and the subcall could be skipped in some cases
                    new_value = list(localize_template(value, ugettext, search_text=(name in include_attribs)))
                    new_attrs.set(name, new_value)
                    changed = True
            if changed:
                # ensure we don't change the original string
                attrs = new_attrs
            yield kind, (tag, attrs), pos
        elif kind is genshi.template.EXPR:
            if data.source != "?":
                # TODO: check if these expressions should be localized
                for key in gettext_re.findall(data.source):
                    key = key[1:-1]
                    if key:
                        new_key = ugettext(key)
                        # TODO: if we do this, it needs to be fixed :-)
                        new_data = genshi.eval.Expression(data.source.replace(key, new_key))
                        # we lose the following data, but can't assign as its readonly
                        # new_data.code.co_filename = data.code.co_filename
                        # new_data.code.co_firstlineno = data.code.co_firstlineno
            yield kind, data, pos
        elif kind is genshi.core.TEXT and search_text:
            # we can adjust this as strings are immutable, so this won't change the original string
            key = data.strip()
            if key:
                new_key = ugettext(key)
                data = data.replace(key, new_key)
            yield kind, data, pos
        elif kind is genshi.template.SUB:
            sub_kind, sub_stream = data
            new_sub_stream = list(localize_template(sub_stream, ugettext, search_text=search_text))
            yield kind, (sub_kind, new_sub_stream), pos
        else:
            yield kind, data, pos

Page Generation with Localized Templates

In order to use the modified Template stream, we basically need to do some processing before the normal Genshi mechanism takes over...

This class allows inclusion of "prefilters" that operate before the Template stream's standard filters (you can't just use a filter to do this, it causes problems):

class PrefilterMarkupTemplate(genshi.template.MarkupTemplate):
    """Derived markup template that can receive prefilters in its generate method"""
    # TODO: try and upstream this into genshi
    def generate(self, prefilters, *args, **kwargs):
        """Apply the template to the given context data.
        
        Any keyword arguments are made available to the template as context
        data.
        
        Only one positional argument is accepted: if it is provided, it must be
        an instance of the `Context` class, and keyword arguments are ignored.
        This calling style is used for internal processing.
        
        @return: a markup event stream representing the result of applying
            the template to the context data.
        """
        if args:
            assert len(args) == 1
            ctxt = args[0]
            if ctxt is None:
                ctxt = genshi.template.Context(**kwargs)
            assert isinstance(ctxt, genshi.template.Context)
        else:
            ctxt = genshi.template.Context(**kwargs)

        stream = self.stream
        for prefilter in prefilters:
            # TODO: add support for context in prefilters
            stream = prefilter(iter(stream))
        for filter_ in self.filters:
            stream = filter_(iter(stream), ctxt)
        return genshi.core.Stream(stream)

This derived class then allows you to call the above localization function as a prefilter on templates. It uses the domain_name as a parameter (this corresponds to which PO/MO file to use for translation, but assumes you can construct or retrieve a translation object for the current user on the fly using a get_translation function (not described here):

class LocalizeMarkupTemplate(PrefilterMarkupTemplate):
    """Derived markup template that can handle localizing before stream generation"""
    def __init__(self, source, basedir=None, filename=None, loader=None,
                 encoding=None, domain_name=None):
        """Initialize a template from either a string or a file-like object."""
        super(LocalizeMarkupTemplate, self).__init__(source, basedir=basedir, filename=filename, loader=loader, encoding=encoding)
        self.domain_name = domain_name

    def localize_prefilter(self, stream):
        """prefilter for localizing..."""
        translation = get_translation(self.domain_name)
        stream = genshi.core.Stream(stream)
        localized_stream = genshigettext.localize_template(stream, translation.ugettext)
        return list(iter(localized_stream))

    # TODO: try and persuade genshi to accept a stream directly here instead of using self.stream - or accept prefilters
    # then we won't use such fragile copied code...
    def generate(self, prefilters, *args, **kwargs):
        """Apply the template to the given context data.
        
        Any keyword arguments are made available to the template as context
        data.
        
        Only one positional argument is accepted: if it is provided, it must be
        an instance of the `Context` class, and keyword arguments are ignored.
        This calling style is used for internal processing.
        
        @return: a markup event stream representing the result of applying
            the template to the context data.
        """
        return super(LocalizeMarkupTemplate, self).generate(prefilters + [self.localize_prefilter], *args, **kwargs)

Using the above code

Basically the above code should really be integrated into Genshi, but if you want to use it first:

  • Place it in a module that you can use
  • Run the extraction code to get the .pot files containing your translatable text (and verify that they seem sensible)
  • Use a translation editor (something like Pootle for online translation or poedit or kbabel for a local GUI
  • Use the LocalizationTemplate class rather than the standard MarkupTemplate class to parse your templates, and set up the required translation hooks for generation
  • Hint: Don't waste your time trying to add _("") in your templates, all the text is automatically extracted.

See also: GenshiRecipes, Internationalization and Localization

Last modified 17 years ago Last modified on Jul 4, 2007, 5:18:21 PM