1 | | {{{ |
2 | | #!rst |
3 | | ============== |
4 | | Markup Streams |
5 | | ============== |
6 | | |
7 | | A stream is the common representation of markup as a *stream of events*. |
8 | | |
9 | | |
10 | | .. contents:: Contents |
11 | | :depth: 2 |
12 | | .. sectnum:: |
13 | | |
14 | | |
15 | | Basics |
16 | | ====== |
17 | | |
18 | | A stream can be attained in a number of ways. It can be: |
19 | | |
20 | | * the result of parsing XML or HTML text, or |
21 | | * programmatically generated, or |
22 | | * the result of selecting a subset of another stream filtered by an XPath |
23 | | expression. |
24 | | |
25 | | For example, the functions ``XML()`` and ``HTML()`` can be used to convert |
26 | | literal XML or HTML text to a markup stream:: |
27 | | |
28 | | >>> from markup import XML |
29 | | >>> stream = XML('<p class="intro">Some text and ' |
30 | | ... '<a href="http://example.org/">a link</a>.' |
31 | | ... '<br/></p>') |
32 | | >>> stream |
33 | | <markup.core.Stream object at 0x6bef0> |
34 | | |
35 | | The stream is the result of parsing the text into events. Each event is a tuple |
36 | | of the form ``(kind, data, pos)``, where: |
37 | | |
38 | | * ``kind`` defines what kind of event it is (such as the start of an element, |
39 | | text, a comment, etc). |
40 | | * ``data`` is the actual data associated with the event. How this looks depends |
41 | | on the event kind. |
42 | | * ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the |
43 | | event “comes from”. |
44 | | |
45 | | :: |
46 | | |
47 | | >>> for kind, data, pos in stream: |
48 | | ... print kind, `data`, pos |
49 | | ... |
50 | | START (u'p', [(u'class', u'intro')]) ('<string>', 1, 0) |
51 | | TEXT u'Some text and ' ('<string>', 1, 31) |
52 | | START (u'a', [(u'href', u'http://example.org/')]) ('<string>', 1, 31) |
53 | | TEXT u'a link' ('<string>', 1, 67) |
54 | | END u'a' ('<string>', 1, 67) |
55 | | TEXT u'.' ('<string>', 1, 72) |
56 | | START (u'br', []) ('<string>', 1, 72) |
57 | | END u'br' ('<string>', 1, 77) |
58 | | END u'p' ('<string>', 1, 77) |
59 | | |
60 | | |
61 | | Filtering |
62 | | ========= |
63 | | |
64 | | One important feature of markup streams is that you can apply *filters* to the |
65 | | stream, either filters that come with Markup, or your own custom filters. |
66 | | |
67 | | A filter is simply a callable that accepts the stream as parameter, and returns |
68 | | the filtered stream:: |
69 | | |
70 | | def noop(stream): |
71 | | """A filter that doesn't actually do anything with the stream.""" |
72 | | for kind, data, pos in stream: |
73 | | yield kind, data, pos |
74 | | |
75 | | Filters can be applied in a number of ways. The simplest is to just call the |
76 | | filter directly:: |
77 | | |
78 | | stream = noop(stream) |
79 | | |
80 | | The ``Stream`` class also provides a ``filter()`` method, which takes an |
81 | | arbitrary number of filter callables and applies them all:: |
82 | | |
83 | | stream = stream.filter(noop) |
84 | | |
85 | | Finally, filters can also be applied using the *bitwise or* operator (``|``), |
86 | | which allows a syntax similar to pipes on Unix shells:: |
87 | | |
88 | | stream = stream | noop |
89 | | |
90 | | One example of a filter included with Markup is the ``HTMLSanitizer`` in |
91 | | ``markup.filters``. It processes a stream of HTML markup, and strips out any |
92 | | potentially dangerous constructs, such as Javascript event handlers. |
93 | | ``HTMLSanitizer`` is not a function, but rather a class that implements |
94 | | ``__call__``, which means instances of the class are callable. |
95 | | |
96 | | Both the ``filter()`` method and the pipe operator allow easy chaining of |
97 | | filters:: |
98 | | |
99 | | from markup.filters import HTMLSanitizer |
100 | | stream = stream.filter(noop, HTMLSanitizer()) |
101 | | |
102 | | That is equivalent to:: |
103 | | |
104 | | stream = stream | noop | HTMLSanitizer() |
105 | | |
106 | | |
107 | | Serialization |
108 | | ============= |
109 | | |
110 | | The ``Stream`` class provides two methods for serializing this list of events: |
111 | | ``serialize()`` and ``render()``. The former is a generator that yields chunks |
112 | | of ``Markup`` objects (which are basically unicode strings). The latter returns |
113 | | a single string, by default UTF-8 encoded. |
114 | | |
115 | | Here's the output from ``serialize()``:: |
116 | | |
117 | | >>> for output in stream.serialize(): |
118 | | ... print `output` |
119 | | ... |
120 | | <Markup u'<p class="intro">'> |
121 | | <Markup u'Some text and '> |
122 | | <Markup u'<a href="http://example.org/">'> |
123 | | <Markup u'a link'> |
124 | | <Markup u'</a>'> |
125 | | <Markup u'.'> |
126 | | <Markup u'<br/>'> |
127 | | <Markup u'</p>'> |
128 | | |
129 | | And here's the output from ``render()``:: |
130 | | |
131 | | >>> print stream.render() |
132 | | <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p> |
133 | | |
134 | | Both methods can be passed a ``method`` parameter that determines how exactly |
135 | | the events are serialzed to text. This parameter can be either “xml” (the |
136 | | default), “xhtml”, “html”, “text”, or a custom serializer class:: |
137 | | |
138 | | >>> print stream.render('html') |
139 | | <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p> |
140 | | |
141 | | Note how the `<br>` element isn't closed, which is the right thing to do for |
142 | | HTML. |
143 | | |
144 | | In addition, the ``render()`` method takes an ``encoding`` parameter, which |
145 | | defaults to “UTF-8”. If set to ``None``, the result will be a unicode string. |
146 | | |
147 | | The different serializer classes in ``markup.output`` can also be used |
148 | | directly:: |
149 | | |
150 | | >>> from markup.filters import HTMLSanitizer |
151 | | >>> from markup.output import TextSerializer |
152 | | >>> print TextSerializer()(HTMLSanitizer()(stream)) |
153 | | Some text and a link. |
154 | | |
155 | | The pipe operator allows a nicer syntax:: |
156 | | |
157 | | >>> print stream | HTMLSanitizer() | TextSerializer() |
158 | | Some text and a link. |
159 | | |
160 | | Using XPath |
161 | | =========== |
162 | | |
163 | | XPath can be used to extract a specific subset of the stream via the |
164 | | ``select()`` method:: |
165 | | |
166 | | >>> substream = stream.select('a') |
167 | | >>> substream |
168 | | <markup.core.Stream object at 0x7118b0> |
169 | | >>> print substream |
170 | | <a href="http://example.org/">a link</a> |
171 | | |
172 | | Often, streams cannot be reused: in the above example, the sub-stream is based |
173 | | on a generator. Once it has been serialized, it will have been fully consumed, |
174 | | and cannot be rendered again. To work around this, you can wrap such a stream |
175 | | in a ``list``:: |
176 | | |
177 | | >>> from markup import Stream |
178 | | >>> substream = Stream(list(stream.select('a'))) |
179 | | >>> substream |
180 | | <markup.core.Stream object at 0x7118b0> |
181 | | >>> print substream |
182 | | <a href="http://example.org/">a link</a> |
183 | | >>> print substream.select('@href') |
184 | | http://example.org/ |
185 | | >>> print substream.select('text()') |
186 | | a link |
187 | | }}} |
188 | | |
| 1 | [[Include(trunk/doc/streams.txt)]] |