Genshi Recipes: Transforming HTML documents
While Genshi XML templates need to be valid XML files, that does not mean you can't use Genshi to transform “old-school” HTML documents. Genshi can parse HTML input, and apply match templates to that input, in order to apply any kind of modification, such as adding site-specific chrome.
Let's say you have the following HTML document (maybe produced by some application or component out of your control), and you'd like to integrate it in your site:
<HTML> <HEAD> <TITLE>Angosso</TITLE> <LINK REL=stylesheet href='badstyle.css'> </HEAD> <BODY> <H1>Angosso</H1> <P> <B>Lorem <I>ipsum</I></B> dolor sit amet, consectetur<BR> adipisicing elit, sed do eiusmod tempor incididunt ut<BR> labore et dolore magna aliqua. Ut enim ad minim veniam,<BR> quis nostrud exercitation ullamco laboris nisi ut<BR> aliquip ex ea commodo consequat. </P> <P> Duis aute irure dolor in reprehenderit in voluptate velit<BR> esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<BR> occaecat cupidatat non proident, sunt in culpa qui officia<BR> deserunt mollit anim <I>id est laborum</I>. </P> </BODY> </HTML>
What you'd like to do is:
- Make that valid XHTML, with a proper DOCTYPE to trigger standards rendering mode in browsers.
- Use “semantic” tags such as <em> and <strong> instead of the more presentational <i> and <b> (whether or not that's really a good idea.)
- Add a new <div id="header"> at the top of the page that contains your site logo.
Using Match Templates
To do that, first start with the following template:
<html xmlns:py="http://genshi.edgewall.org/" py:strip=""> <!--! Add a header DIV on top of every page with a logo image --> <body py:match="body"> <div id="header"> <img src="logo3494882_md.png" alt="Bad Style"/> </div> ${select('*')} </body> <!--! Use semantic instead of presentational tags for emphasis --> <strong py:match="B|b">${select('*|text()')}</strong> <em py:match="I|i">${select('*|text()')}</em> <!--! Include the actual HTML stream, which will be processed by the rules defined above --> ${input} </html>
That template defines a couple of match templates that do what we need. At the end, it pulls in the actual HTML content using the “input” variable.
Finally, the following script would drive the transformation:
import os, sys from genshi.input import HTMLParser from genshi.template import Context, MarkupTemplate def transform(html_filename, tmpl_filename): tmpl_fileobj = open(tmpl_filename) tmpl = MarkupTemplate(tmpl_fileobj, tmpl_filename) tmpl_fileobj.close() html_fileobj = open(html_filename) html = HTMLParser(html_fileobj, html_filename) print tmpl.generate(Context(input=html)).render('xhtml', doctype='html') html_fileobj.close() if __name__ == '__main__': transform(sys.argv[1], sys.argv[2])
This would then produce the following output (ignoring some small whitespace differences):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>Angosso</title> <link rel="stylesheet" href="badstyle.css" /> </head> <body> <div id="header"> <img src="logo3494882_md.png" alt="Bad Style" /> </div> <h1>Aaargh</h1> <p> <strong>Lorem <em>ipsum</em></strong> dolor sit amet, consectetur<br /> adipisicing elit, sed do eiusmod tempor incididunt ut<br /> labore et dolore magna aliqua. Ut enim ad minim veniam,<br /> quis nostrud exercitation ullamco laboris nisi ut<br /> aliquip ex ea commodo consequat. </p><p> Duis aute irure dolor in reprehenderit in voluptate velit<br /> esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<br /> occaecat cupidatat non proident, sunt in culpa qui officia<br /> deserunt mollit anim <em>id est laborum</em>. </p> </body> </html>
Using the Transformer Filter
The Transformer filter equivalent of the above code:
import sys from genshi.input import HTMLParser from genshi.builder import tag from genshi.filters.transform import Transformer def transform(html_filename): html_fileobj = open(html_filename) html_parser = HTMLParser(html_fileobj, html_filename) html_stream = html_parser.parse() transformed_stream = html_stream | Transformer('body') \ .prepend(tag.div( tag.img(src="logo.png", alt="Bad Style"), id="header")) \ .select('.//b').unwrap().wrap('strong').end() \ .select('.//i').unwrap().wrap('em') print transformed_stream.render('xhtml', doctype='html') html_fileobj.close() if __name__ == '__main__': transform(sys.argv[1])}}}
See also: GenshiRecipes, Genshi XML Template Language, genshi.builder, genshi.filters.transform