Edgewall Software

Version 1 (modified by cmlenz, 18 years ago) (diff)

New recipe for transforming HTML

Markup Recipes?: Transforming HTML documents

While Markup templates need to be valid XML files, that does not mean you can't use Markup to transform “old-school” HTML documents. Markup can parse HTML input, and apply match templates to that input, in order to apply any kind of modification, such as adding site-specific chrome.

Let's say you have the following HTML document (maybe produced by some application or component out of your control), and you'd like to integrate it in your site:

<HTML>
 <HEAD>
  <TITLE>Aaarrgh</TITLE>
  <LINK REL=stylesheet href='badstyle.css'>
 </HEAD>
 
 <BODY>
  <H1>Aaargh</H1>
  <P>
    <B>Lorem <I>ipsum</I></B> dolor sit amet, consectetur<BR>
    adipisicing elit, sed do eiusmod tempor incididunt ut<BR>
    labore et dolore magna aliqua. Ut enim ad minim veniam,<BR>
    quis nostrud exercitation ullamco laboris nisi ut<BR>
    aliquip ex ea commodo consequat.
  </P>
  <P>
    Duis aute irure dolor in reprehenderit in voluptate velit<BR>
    esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<BR>
    occaecat cupidatat non proident, sunt in culpa qui officia<BR>
    deserunt mollit anim <I>id est laborum</I>.
  </P>
 </BODY>

</HTML>

What you'd like to do is:

  • Make that valid XHTML, with a proper DOCTYPE to trigger standards rendering mode in browsers.
  • Use “semantic” tags such as <em> and <strong> instead of the more presentational <i> and <b> (whether or not that's really a good idea.)
  • Add a new <div id="header"> at the top of the page that contains your site logo.

To do that, first start with the following template:

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns:py="http://markup.edgewall.org/" py:strip="">

  <!--! Add a header DIV on top of every page with a logo image -->
  <body py:match="body">
    <div id="header">
      <img src="logo.png" alt="Bad Style"/>
    </div>
    ${select('*')}
  </body>

  <!--! Use semantic instead of presentational tags for emphasis -->
  <strong py:match="B|b">${select('*|text()')}</strong>
  <em py:match="I|i">${select('*|text()')}</em>

  <!--! Include the actual HTML stream, which will be processed by the rules
        defined above -->
  ${input}

</html>

That template defines a couple of match templates that do what we need. At the end, it pulls in the actual HTML content using the “input” variable.

Finally, the following script would drive the transformation:

import os, sys
from markup.input import HTMLParser
from markup.template import Context, Template

def transform(html_filename, tmpl_filename):
    html_fileobj = open(html_filename)
    html = HTMLParser(html_fileobj, html_filename)
    html_fileobj.close()

    tmpl_fileobj = open(tmpl_filename)
    tmpl = Template(tmpl_fileobj, tmpl_filename)
    tmpl_fileobj.close()

    print tmpl.generate(Context(input=html)).render('xhtml')

if __name__ == '__main__':
    transform(sys.argv[1], sys.argv[2])

This would then produce the following output (ignoring some small whitespace differences):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
 <head>
  <title>Aaarrgh</title>
  <link rel="stylesheet" href="badstyle.css" />
 </head>
 <body>
  <div id="header">
   <img src="logo.png" alt="Bad Style" />
  </div>
  <h1>Aaargh</h1>
  <p>
    <strong>Lorem <em>ipsum</em></strong> dolor sit amet, consectetur<br />
    adipisicing elit, sed do eiusmod tempor incididunt ut<br />
    labore et dolore magna aliqua. Ut enim ad minim veniam,<br />
    quis nostrud exercitation ullamco laboris nisi ut<br />
    aliquip ex ea commodo consequat.
  </p><p>
    Duis aute irure dolor in reprehenderit in voluptate velit<br />
    esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<br />
    occaecat cupidatat non proident, sunt in culpa qui officia<br />
    deserunt mollit anim <em>id est laborum</em>.
  </p>
 </body>
</html>

See also: MarkupRecipes?, MarkupTemplates