Changes between Version 8 and Version 9 of MarkupStream
- Timestamp:
- Sep 8, 2006, 10:08:45 AM (18 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
MarkupStream
v8 v9 1 = Markup Streams = 1 {{{ 2 #!rst 3 ============== 4 Markup Streams 5 ============== 2 6 3 A [wiki:ApiDocs/MarkupCore#markup.core:Stream stream] is the common representation of markup as a ''stream of events''. 7 A stream is the common representation of markup as a *stream of events*. 8 9 10 .. contents:: Contents 11 :depth: 2 12 .. sectnum:: 13 14 15 Basics 16 ====== 4 17 5 18 A stream can be attained in a number of ways. It can be: 6 * the result of parsing XML or HTML text, or7 * [wiki:MarkupBuilder programmatically generated], or8 * the result of selecting a subset of another stream filtered by an XPath expression.9 19 10 For example, the functions `XML()` and `HTML()` can be used to convert literal XML or HTML text to a markup stream: 20 * the result of parsing XML or HTML text, or 21 * programmatically generated, or 22 * the result of selecting a subset of another stream filtered by an XPath 23 expression. 11 24 12 {{{ 13 >>> from markup import XML 14 >>> stream = XML('<p class="intro">Some text and ' 15 ... '<a href="http://example.org/">a link</a>.' 16 ... '<br/></p>') 17 >>> stream 18 <markup.core.Stream object at 0x6bef0> 19 }}} 25 For example, the functions ``XML()`` and ``HTML()`` can be used to convert 26 literal XML or HTML text to a markup stream:: 20 27 21 The stream is the result of parsing the text into events. Each event is a tuple of the form `(kind, data, pos)`, where: 22 * `kind` defines what kind of event it is (such as the start of an element, text, a comment, etc). 23 * `data` is the actual data associated with the event. How this looks depends on the event kind. 24 * `pos` is a `(filename, lineno, column)` tuple that describes where the event “comes from”. 28 >>> from markup import XML 29 >>> stream = XML('<p class="intro">Some text and ' 30 ... '<a href="http://example.org/">a link</a>.' 31 ... '<br/></p>') 32 >>> stream 33 <markup.core.Stream object at 0x6bef0> 25 34 26 {{{ 27 >>> for kind, data, pos in stream: 28 ... print kind, `data`, pos 29 ... 30 START (u'p', [(u'class', u'intro')]) ('<string>', 1, 0) 31 TEXT u'Some text and ' ('<string>', 1, 31) 32 START (u'a', [(u'href', u'http://example.org/')]) ('<string>', 1, 31) 33 TEXT u'a link' ('<string>', 1, 67) 34 END u'a' ('<string>', 1, 67) 35 TEXT u'.' ('<string>', 1, 72) 36 START (u'br', []) ('<string>', 1, 72) 37 END u'br' ('<string>', 1, 77) 38 END u'p' ('<string>', 1, 77) 39 }}} 35 The stream is the result of parsing the text into events. Each event is a tuple 36 of the form ``(kind, data, pos)``, where: 40 37 41 == Filtering == 38 * ``kind`` defines what kind of event it is (such as the start of an element, 39 text, a comment, etc). 40 * ``data`` is the actual data associated with the event. How this looks depends 41 on the event kind. 42 * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the 43 event “comes from”. 42 44 43 One important feature of markup streams is that you can apply ''filters'' to the stream, either filters that come with Markup, or your own custom filters. 45 :: 44 46 45 A filter is simply a callable that accepts the stream as parameter, and returns the filtered stream: 47 >>> for kind, data, pos in stream: 48 ... print kind, `data`, pos 49 ... 50 START (u'p', [(u'class', u'intro')]) ('<string>', 1, 0) 51 TEXT u'Some text and ' ('<string>', 1, 31) 52 START (u'a', [(u'href', u'http://example.org/')]) ('<string>', 1, 31) 53 TEXT u'a link' ('<string>', 1, 67) 54 END u'a' ('<string>', 1, 67) 55 TEXT u'.' ('<string>', 1, 72) 56 START (u'br', []) ('<string>', 1, 72) 57 END u'br' ('<string>', 1, 77) 58 END u'p' ('<string>', 1, 77) 46 59 47 {{{48 #!python49 def noop(stream):50 """A filter that doesn't actually do anything with the stream."""51 for kind, data, pos in stream:52 yield kind, data, pos53 }}}54 60 55 Filters can be applied in a number of ways. The simplest is to just call the filter directly: 61 Filtering 62 ========= 56 63 57 {{{ 58 #!python 59 stream = noop(stream) 60 }}} 64 One important feature of markup streams is that you can apply *filters* to the 65 stream, either filters that come with Markup, or your own custom filters. 61 66 62 The `Stream` class also provides a `filter()` method, which takes an arbitrary number of filter callables and applies them all: 67 A filter is simply a callable that accepts the stream as parameter, and returns 68 the filtered stream:: 63 69 64 {{{ 65 #!python 66 stream = stream.filter(noop) 67 }}} 70 def noop(stream): 71 """A filter that doesn't actually do anything with the stream.""" 72 for kind, data, pos in stream: 73 yield kind, data, pos 68 74 69 Finally, filters can also be applied using the ''bitwise or'' operator (`|`), which allows a syntax similar to pipes on Unix shells: 75 Filters can be applied in a number of ways. The simplest is to just call the 76 filter directly:: 70 77 71 {{{ 72 #!python 73 stream = stream | noop 74 }}} 78 stream = noop(stream) 75 79 76 ''Note: this is only available in the current development version (0.3)'' 80 The ``Stream`` class also provides a ``filter()`` method, which takes an 81 arbitrary number of filter callables and applies them all:: 77 82 78 One example of a filter included with Markup is the `HTMLSanitizer` in `markup.filters`. It processes a stream of HTML markup, and strips out any potentially dangerous constructs, such as Javascript event handlers. `HTMLSanitizer` is not a function, but rather a class that implements `__call__`, which means instances of the class are callable. 83 stream = stream.filter(noop) 79 84 80 Both the `filter()` method and the pipe operator allow easy chaining of filters: 81 {{{ 82 #!python 83 from markup.filters import HTMLSanitizer 84 stream = stream.filter(noop, HTMLSanitizer()) 85 }}} 85 Finally, filters can also be applied using the *bitwise or* operator (``|``), 86 which allows a syntax similar to pipes on Unix shells:: 86 87 87 That is equivalent to: 88 {{{ 89 #!python 90 stream = stream | noop | HTMLSanitizer() 91 }}} 88 stream = stream | noop 92 89 93 == Serialization == 90 One example of a filter included with Markup is the ``HTMLSanitizer`` in 91 ``markup.filters``. It processes a stream of HTML markup, and strips out any 92 potentially dangerous constructs, such as Javascript event handlers. 93 ``HTMLSanitizer`` is not a function, but rather a class that implements 94 ``__call__``, which means instances of the class are callable. 94 95 95 The `Stream` class provides two methods for serializing this list of events: [wiki:ApiDocs/MarkupCore#markup.core:Stream:serialize serialize()] and [wiki:ApiDocs/MarkupCore#markup.core:Stream:render render()]. The former is a generator that yields chunks of `Markup` objects (which are basically unicode strings). The latter returns a single string, by default UTF-8 encoded. 96 Both the ``filter()`` method and the pipe operator allow easy chaining of 97 filters:: 96 98 97 Here's the output from `serialize()`: 99 from markup.filters import HTMLSanitizer 100 stream = stream.filter(noop, HTMLSanitizer()) 98 101 99 {{{ 100 >>> for output in stream.serialize(): 101 ... print `output` 102 ... 103 <Markup u'<p class="intro">'> 104 <Markup u'Some text and '> 105 <Markup u'<a href="http://example.org/">'> 106 <Markup u'a link'> 107 <Markup u'</a>'> 108 <Markup u'.'> 109 <Markup u'<br/>'> 110 <Markup u'</p>'> 111 }}} 102 That is equivalent to:: 112 103 113 And here's the output from `render()`: 104 stream = stream | noop | HTMLSanitizer() 114 105 115 {{{116 >>> print stream.render()117 <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>118 }}}119 106 120 Both methods can be passed a `method` parameter that determines how exactly the events are serialzed to text. This parameter can be either “xml” (the default), “xhtml”, “html”, “text”, or a custom serializer class: 107 Serialization 108 ============= 121 109 122 {{{ 123 >>> print stream.render('html') 124 <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p> 125 }}} 110 The ``Stream`` class provides two methods for serializing this list of events: 111 ``serialize()`` and ``render()``. The former is a generator that yields chunks 112 of ``Markup`` objects (which are basically unicode strings). The latter returns 113 a single string, by default UTF-8 encoded. 126 114 127 ''(Note how the `<br>` element isn't closed, which is the right thing to do for HTML.)'' 115 Here's the output from ``serialize()``:: 128 116 129 In addition, the `render()` method takes an `encoding` parameter, which defaults to “UTF-8”. If set to `None`, the result will be a unicode string. 117 >>> for output in stream.serialize(): 118 ... print `output` 119 ... 120 <Markup u'<p class="intro">'> 121 <Markup u'Some text and '> 122 <Markup u'<a href="http://example.org/">'> 123 <Markup u'a link'> 124 <Markup u'</a>'> 125 <Markup u'.'> 126 <Markup u'<br/>'> 127 <Markup u'</p>'> 130 128 131 The different serializer classes in `markup.output` can also be used directly:129 And here's the output from ``render()``:: 132 130 133 {{{ 134 >>> from markup.filters import HTMLSanitizer 135 >>> from markup.output import TextSerializer 136 >>> print TextSerializer()(HTMLSanitizer()(stream)) 137 Some text and a link. 138 }}} 131 >>> print stream.render() 132 <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p> 139 133 140 The pipe operator (added in 0.3) allows a nicer syntax: 134 Both methods can be passed a ``method`` parameter that determines how exactly 135 the events are serialzed to text. This parameter can be either “xml” (the 136 default), “xhtml”, “html”, “text”, or a custom serializer class:: 141 137 142 {{{ 143 >>> print stream | HTMLSanitizer() | TextSerializer() 144 Some text and a link. 145 }}} 138 >>> print stream.render('html') 139 <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p> 146 140 147 == Using XPath == 141 Note how the `<br>` element isn't closed, which is the right thing to do for 142 HTML. 148 143 149 XPath can be used to extract a specific subset of the stream via the `select()` method: 144 In addition, the ``render()`` method takes an ``encoding`` parameter, which 145 defaults to “UTF-8”. If set to ``None``, the result will be a unicode string. 150 146 151 {{{ 152 >>> substream = stream.select('a') 153 >>> substream 154 <markup.core.Stream object at 0x7118b0> 155 >>> print substream 156 <a href="http://example.org/">a link</a> 157 }}} 147 The different serializer classes in ``markup.output`` can also be used 148 directly:: 158 149 159 Often, streams cannot be reused: in the above example, the sub-stream is based on a generator. Once it has been serialized, it will have been fully consumed, and cannot be rendered again. To work around this, you can wrap such a stream in a `list`: 150 >>> from markup.filters import HTMLSanitizer 151 >>> from markup.output import TextSerializer 152 >>> print TextSerializer()(HTMLSanitizer()(stream)) 153 Some text and a link. 160 154 161 {{{ 162 >>> from markup import Stream 163 >>> substream = Stream(list(stream.select('a'))) 164 >>> substream 165 <markup.core.Stream object at 0x7118b0> 166 >>> print substream 167 <a href="http://example.org/">a link</a> 168 >>> print substream.select('@href') 169 http://example.org/ 170 >>> print substream.select('text()') 171 a link 155 The pipe operator allows a nicer syntax:: 156 157 >>> print stream | HTMLSanitizer() | TextSerializer() 158 Some text and a link. 159 160 Using XPath 161 =========== 162 163 XPath can be used to extract a specific subset of the stream via the 164 ``select()`` method:: 165 166 >>> substream = stream.select('a') 167 >>> substream 168 <markup.core.Stream object at 0x7118b0> 169 >>> print substream 170 <a href="http://example.org/">a link</a> 171 172 Often, streams cannot be reused: in the above example, the sub-stream is based 173 on a generator. Once it has been serialized, it will have been fully consumed, 174 and cannot be rendered again. To work around this, you can wrap such a stream 175 in a ``list``:: 176 177 >>> from markup import Stream 178 >>> substream = Stream(list(stream.select('a'))) 179 >>> substream 180 <markup.core.Stream object at 0x7118b0> 181 >>> print substream 182 <a href="http://example.org/">a link</a> 183 >>> print substream.select('@href') 184 http://example.org/ 185 >>> print substream.select('text()') 186 a link 172 187 }}} 173 188