Edgewall Software

Genshi Recipes: Transforming HTML documents

While Genshi XML templates need to be valid XML files, that does not mean you can't use Genshi to transform “old-school” HTML documents. Genshi can parse HTML input, and apply match templates to that input, in order to apply any kind of modification, such as adding site-specific chrome.

Let's say you have the following HTML document (maybe produced by some application or component out of your control), and you'd like to integrate it in your site:

<HTML>
 <HEAD>
  <TITLE>Angosso</TITLE>
  <LINK REL=stylesheet href='badstyle.css'>
 </HEAD>
 
 <BODY>
  <H1>Angosso</H1>
  <P>
    <B>Lorem <I>ipsum</I></B> dolor sit amet, consectetur<BR>
    adipisicing elit, sed do eiusmod tempor incididunt ut<BR>
    labore et dolore magna aliqua. Ut enim ad minim veniam,<BR>
    quis nostrud exercitation ullamco laboris nisi ut<BR>
    aliquip ex ea commodo consequat.
  </P>
  <P>
    Duis aute irure dolor in reprehenderit in voluptate velit<BR>
    esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<BR>
    occaecat cupidatat non proident, sunt in culpa qui officia<BR>
    deserunt mollit anim <I>id est laborum</I>.
  </P>
 </BODY>

</HTML>

What you'd like to do is:

  • Make that valid XHTML, with a proper DOCTYPE to trigger standards rendering mode in browsers.
  • Use “semantic” tags such as <em> and <strong> instead of the more presentational <i> and <b> (whether or not that's really a good idea.)
  • Add a new <div id="header"> at the top of the page that contains your site logo.

Using Match Templates

To do that, first start with the following template:

<html xmlns:py="http://genshi.edgewall.org/" py:strip="">

  <!--! Add a header DIV on top of every page with a logo image -->
  <body py:match="body">
    <div id="header">
      <img src="logo3494882_md.png" alt="Bad Style"/>
    </div>
    ${select('*')}
  </body>

  <!--! Use semantic instead of presentational tags for emphasis -->
  <strong py:match="B|b">${select('*|text()')}</strong>
  <em py:match="I|i">${select('*|text()')}</em>

  <!--! Include the actual HTML stream, which will be processed by the rules
        defined above -->
  ${input}

</html>

That template defines a couple of match templates that do what we need. At the end, it pulls in the actual HTML content using the “input” variable.

Finally, the following script would drive the transformation:

import os, sys
from genshi.input import HTMLParser
from genshi.template import Context, MarkupTemplate

def transform(html_filename, tmpl_filename):
    tmpl_fileobj = open(tmpl_filename)
    tmpl = MarkupTemplate(tmpl_fileobj, tmpl_filename)
    tmpl_fileobj.close()

    html_fileobj = open(html_filename)
    html = HTMLParser(html_fileobj, html_filename)
    print tmpl.generate(Context(input=html)).render('xhtml', doctype='html')
    html_fileobj.close()

if __name__ == '__main__':
    transform(sys.argv[1], sys.argv[2])

This would then produce the following output (ignoring some small whitespace differences):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
 <head>
  <title>Angosso</title>
  <link rel="stylesheet" href="badstyle.css" />
 </head>
 <body>
  <div id="header">
   <img src="logo3494882_md.png" alt="Bad Style" />
  </div>
  <h1>Aaargh</h1>
  <p>
    <strong>Lorem <em>ipsum</em></strong> dolor sit amet, consectetur<br />
    adipisicing elit, sed do eiusmod tempor incididunt ut<br />
    labore et dolore magna aliqua. Ut enim ad minim veniam,<br />
    quis nostrud exercitation ullamco laboris nisi ut<br />
    aliquip ex ea commodo consequat.
  </p><p>
    Duis aute irure dolor in reprehenderit in voluptate velit<br />
    esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<br />
    occaecat cupidatat non proident, sunt in culpa qui officia<br />
    deserunt mollit anim <em>id est laborum</em>.
  </p>
 </body>
</html>

Using the Transformer Filter

The Transformer filter equivalent of the above code:

import sys
from genshi.input import HTMLParser
from genshi.builder import tag
from genshi.filters.transform import Transformer

def transform(html_filename):
    html_fileobj = open(html_filename)
    html_parser = HTMLParser(html_fileobj, html_filename)
    html_stream = html_parser.parse()
    transformed_stream = html_stream | Transformer('body') \
        .prepend(tag.div(
            tag.img(src="logo.png", alt="Bad Style"),
            id="header")) \
        .select('.//b').unwrap().wrap('strong').end() \
        .select('.//i').unwrap().wrap('em')

    print transformed_stream.render('xhtml', doctype='html')
    html_fileobj.close()


if __name__ == '__main__':
    transform(sys.argv[1])}}}

See also: GenshiRecipes, Genshi XML Template Language, genshi.builder, genshi.filters.transform

Last modified 14 years ago Last modified on Jan 12, 2011, 10:19:24 PM