| | 1 | = [MarkupRecipes Markup Recipes]: Transforming HTML documents = |
| | 2 | |
| | 3 | While [MarkupTemplates Markup templates] need to be valid XML files, that does not mean you can't use Markup to transform “old-school” HTML documents. Markup can parse HTML input, and apply ''match templates'' to that input, in order to apply any kind of modification, such as adding site-specific chrome. |
| | 4 | |
| | 5 | Let's say you have the following HTML document (maybe produced by some application or component out of your control), and you'd like to integrate it in your site: |
| | 6 | |
| | 7 | {{{ |
| | 8 | #!xml |
| | 9 | <HTML> |
| | 10 | <HEAD> |
| | 11 | <TITLE>Aaarrgh</TITLE> |
| | 12 | <LINK REL=stylesheet href='badstyle.css'> |
| | 13 | </HEAD> |
| | 14 | |
| | 15 | <BODY> |
| | 16 | <H1>Aaargh</H1> |
| | 17 | <P> |
| | 18 | <B>Lorem <I>ipsum</I></B> dolor sit amet, consectetur<BR> |
| | 19 | adipisicing elit, sed do eiusmod tempor incididunt ut<BR> |
| | 20 | labore et dolore magna aliqua. Ut enim ad minim veniam,<BR> |
| | 21 | quis nostrud exercitation ullamco laboris nisi ut<BR> |
| | 22 | aliquip ex ea commodo consequat. |
| | 23 | </P> |
| | 24 | <P> |
| | 25 | Duis aute irure dolor in reprehenderit in voluptate velit<BR> |
| | 26 | esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<BR> |
| | 27 | occaecat cupidatat non proident, sunt in culpa qui officia<BR> |
| | 28 | deserunt mollit anim <I>id est laborum</I>. |
| | 29 | </P> |
| | 30 | </BODY> |
| | 31 | |
| | 32 | </HTML> |
| | 33 | }}} |
| | 34 | |
| | 35 | What you'd like to do is: |
| | 36 | * Make that valid XHTML, with a proper DOCTYPE to trigger standards rendering mode in browsers. |
| | 37 | * Use “semantic” tags such as `<em>` and `<strong>` instead of the more presentational `<i>` and `<b>` (whether or not that's really a good idea.) |
| | 38 | * Add a new `<div id="header">` at the top of the page that contains your site logo. |
| | 39 | |
| | 40 | To do that, first start with the following template: |
| | 41 | |
| | 42 | {{{ |
| | 43 | #!xml |
| | 44 | <!DOCTYPE html |
| | 45 | PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" |
| | 46 | "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> |
| | 47 | <html xmlns:py="http://markup.edgewall.org/" py:strip=""> |
| | 48 | |
| | 49 | <!--! Add a header DIV on top of every page with a logo image --> |
| | 50 | <body py:match="body"> |
| | 51 | <div id="header"> |
| | 52 | <img src="logo.png" alt="Bad Style"/> |
| | 53 | </div> |
| | 54 | ${select('*')} |
| | 55 | </body> |
| | 56 | |
| | 57 | <!--! Use semantic instead of presentational tags for emphasis --> |
| | 58 | <strong py:match="B|b">${select('*|text()')}</strong> |
| | 59 | <em py:match="I|i">${select('*|text()')}</em> |
| | 60 | |
| | 61 | <!--! Include the actual HTML stream, which will be processed by the rules |
| | 62 | defined above --> |
| | 63 | ${input} |
| | 64 | |
| | 65 | </html> |
| | 66 | }}} |
| | 67 | |
| | 68 | That template defines a couple of match templates that do what we need. At the end, it pulls in the actual HTML content using the “input” variable. |
| | 69 | |
| | 70 | Finally, the following script would drive the transformation: |
| | 71 | |
| | 72 | {{{ |
| | 73 | #!python |
| | 74 | import os, sys |
| | 75 | from markup.input import HTMLParser |
| | 76 | from markup.template import Context, Template |
| | 77 | |
| | 78 | def transform(html_filename, tmpl_filename): |
| | 79 | html_fileobj = open(html_filename) |
| | 80 | html = HTMLParser(html_fileobj, html_filename) |
| | 81 | html_fileobj.close() |
| | 82 | |
| | 83 | tmpl_fileobj = open(tmpl_filename) |
| | 84 | tmpl = Template(tmpl_fileobj, tmpl_filename) |
| | 85 | tmpl_fileobj.close() |
| | 86 | |
| | 87 | print tmpl.generate(Context(input=html)).render('xhtml') |
| | 88 | |
| | 89 | if __name__ == '__main__': |
| | 90 | transform(sys.argv[1], sys.argv[2]) |
| | 91 | }}} |
| | 92 | |
| | 93 | This would then produce the following output (ignoring some small whitespace differences): |
| | 94 | |
| | 95 | {{{ |
| | 96 | #!xml |
| | 97 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" |
| | 98 | "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> |
| | 99 | <html> |
| | 100 | <head> |
| | 101 | <title>Aaarrgh</title> |
| | 102 | <link rel="stylesheet" href="badstyle.css" /> |
| | 103 | </head> |
| | 104 | <body> |
| | 105 | <div id="header"> |
| | 106 | <img src="logo.png" alt="Bad Style" /> |
| | 107 | </div> |
| | 108 | <h1>Aaargh</h1> |
| | 109 | <p> |
| | 110 | <strong>Lorem <em>ipsum</em></strong> dolor sit amet, consectetur<br /> |
| | 111 | adipisicing elit, sed do eiusmod tempor incididunt ut<br /> |
| | 112 | labore et dolore magna aliqua. Ut enim ad minim veniam,<br /> |
| | 113 | quis nostrud exercitation ullamco laboris nisi ut<br /> |
| | 114 | aliquip ex ea commodo consequat. |
| | 115 | </p><p> |
| | 116 | Duis aute irure dolor in reprehenderit in voluptate velit<br /> |
| | 117 | esse cillum dolore eu fugiat nulla pariatur. Excepteur sint<br /> |
| | 118 | occaecat cupidatat non proident, sunt in culpa qui officia<br /> |
| | 119 | deserunt mollit anim <em>id est laborum</em>. |
| | 120 | </p> |
| | 121 | </body> |
| | 122 | </html> |
| | 123 | }}} |
| | 124 | |
| | 125 | ---- |
| | 126 | See also: MarkupRecipes, MarkupTemplates |