Edgewall Software

source: trunk/doc/streams.txt

Last change on this file was 1080, checked in by cmlenz, 14 years ago

Avoid unicode literals in reprs of QName and Namespace when not necessary.

  • Property svn:eol-style set to native
  • Property svn:mime-type set to text/x-rst
File size: 15.5 KB
RevLine 
[283]1.. -*- mode: rst; encoding: utf-8 -*-
2
3==============
4Markup Streams
5==============
6
7A stream is the common representation of markup as a *stream of events*.
8
9
10.. contents:: Contents
[869]11   :depth: 2
[283]12.. sectnum::
13
14
15Basics
16======
17
18A stream can be attained in a number of ways. It can be:
19
20* the result of parsing XML or HTML text, or
[530]21* the result of selecting a subset of another stream using XPath, or
22* programmatically generated.
[283]23
24For example, the functions ``XML()`` and ``HTML()`` can be used to convert
[612]25literal XML or HTML text to a markup stream:
[283]26
[614]27.. code-block:: pycon
[612]28
[287]29  >>> from genshi import XML
[283]30  >>> stream = XML('<p class="intro">Some text and '
31  ...              '<a href="http://example.org/">a link</a>.'
32  ...              '<br/></p>')
33  >>> stream
[464]34  <genshi.core.Stream object at ...>
[283]35
36The stream is the result of parsing the text into events. Each event is a tuple
37of the form ``(kind, data, pos)``, where:
38
39* ``kind`` defines what kind of event it is (such as the start of an element,
40  text, a comment, etc).
41* ``data`` is the actual data associated with the event. How this looks depends
[464]42  on the event kind (see  `event kinds`_)
[283]43* ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the
44  event “comes from”.
45
[614]46.. code-block:: pycon
[283]47
48  >>> for kind, data, pos in stream:
[1076]49  ...     print('%s %r %r' % (kind, data, pos))
[283]50  ...
[1080]51  START (QName('p'), Attrs([(QName('class'), u'intro')])) (None, 1, 0)
[464]52  TEXT u'Some text and ' (None, 1, 17)
[1080]53  START (QName('a'), Attrs([(QName('href'), u'http://example.org/')])) (None, 1, 31)
[464]54  TEXT u'a link' (None, 1, 61)
[1080]55  END QName('a') (None, 1, 67)
[464]56  TEXT u'.' (None, 1, 71)
[1080]57  START (QName('br'), Attrs()) (None, 1, 72)
58  END QName('br') (None, 1, 77)
59  END QName('p') (None, 1, 77)
[283]60
61
62Filtering
63=========
64
65One important feature of markup streams is that you can apply *filters* to the
[287]66stream, either filters that come with Genshi, or your own custom filters.
[283]67
68A filter is simply a callable that accepts the stream as parameter, and returns
[612]69the filtered stream:
[283]70
[612]71.. code-block:: python
72
[283]73  def noop(stream):
74      """A filter that doesn't actually do anything with the stream."""
75      for kind, data, pos in stream:
76          yield kind, data, pos
77
78Filters can be applied in a number of ways. The simplest is to just call the
[612]79filter directly:
[283]80
[612]81.. code-block:: python
82
[283]83  stream = noop(stream)
84
85The ``Stream`` class also provides a ``filter()`` method, which takes an
[612]86arbitrary number of filter callables and applies them all:
[283]87
[612]88.. code-block:: python
89
[283]90  stream = stream.filter(noop)
91
92Finally, filters can also be applied using the *bitwise or* operator (``|``),
[612]93which allows a syntax similar to pipes on Unix shells:
[283]94
[612]95.. code-block:: python
96
[283]97  stream = stream | noop
98
[287]99One example of a filter included with Genshi is the ``HTMLSanitizer`` in
100``genshi.filters``. It processes a stream of HTML markup, and strips out any
[283]101potentially dangerous constructs, such as Javascript event handlers.
102``HTMLSanitizer`` is not a function, but rather a class that implements
[612]103``__call__``, which means instances of the class are callable:
[283]104
[612]105.. code-block:: python
106
[530]107  stream = stream | HTMLSanitizer()
108
[283]109Both the ``filter()`` method and the pipe operator allow easy chaining of
[612]110filters:
[283]111
[612]112.. code-block:: python
113
[287]114  from genshi.filters import HTMLSanitizer
[283]115  stream = stream.filter(noop, HTMLSanitizer())
116
[612]117That is equivalent to:
[283]118
[612]119.. code-block:: python
120
[283]121  stream = stream | noop | HTMLSanitizer()
122
[530]123For more information about the built-in filters, see `Stream Filters`_.
[283]124
[530]125.. _`Stream Filters`: filters.html
126
127
[283]128Serialization
129=============
130
[530]131Serialization means producing some kind of textual output from a stream of
132events, which you'll need when you want to transmit or store the results of
133generating or otherwise processing markup.
[283]134
[869]135The ``Stream`` class provides two methods for serialization: ``serialize()``
136and ``render()``. The former is a generator that yields chunks of ``Markup``
137objects (which are basically unicode strings that are considered safe for
138output on the web). The latter returns a single string, by default UTF-8
139encoded.
[530]140
[612]141Here's the output from ``serialize()``:
[283]142
[614]143.. code-block:: pycon
[612]144
[283]145  >>> for output in stream.serialize():
[1076]146  ...     print(repr(output))
[283]147  ...
148  <Markup u'<p class="intro">'>
149  <Markup u'Some text and '>
150  <Markup u'<a href="http://example.org/">'>
151  <Markup u'a link'>
152  <Markup u'</a>'>
153  <Markup u'.'>
154  <Markup u'<br/>'>
155  <Markup u'</p>'>
156
[612]157And here's the output from ``render()``:
[283]158
[614]159.. code-block:: pycon
[612]160
[1076]161  >>> print(stream.render())
[283]162  <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>
163
164Both methods can be passed a ``method`` parameter that determines how exactly
[869]165the events are serialized to text. This parameter can be either a string or a
166custom serializer class:
[283]167
[614]168.. code-block:: pycon
[612]169
[1076]170  >>> print(stream.render('html'))
[283]171  <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p>
172
173Note how the `<br>` element isn't closed, which is the right thing to do for
[869]174HTML. See  `serialization methods`_ for more details.
[283]175
176In addition, the ``render()`` method takes an ``encoding`` parameter, which
177defaults to “UTF-8”. If set to ``None``, the result will be a unicode string.
178
[287]179The different serializer classes in ``genshi.output`` can also be used
[612]180directly:
[283]181
[614]182.. code-block:: pycon
[612]183
[287]184  >>> from genshi.filters import HTMLSanitizer
185  >>> from genshi.output import TextSerializer
[1076]186  >>> print(''.join(TextSerializer()(HTMLSanitizer()(stream))))
[283]187  Some text and a link.
188
[612]189The pipe operator allows a nicer syntax:
[283]190
[614]191.. code-block:: pycon
[612]192
[1076]193  >>> print(stream | HTMLSanitizer() | TextSerializer())
[283]194  Some text and a link.
195
[464]196
[869]197.. _`serialization methods`:
198
199Serialization Methods
200---------------------
201
202Genshi supports the use of different serialization methods to use for creating
203a text representation of a markup stream.
204
205``xml``
206  The ``XMLSerializer`` is the default serialization method and results in
207  proper XML output including namespace support, the XML declaration, CDATA
208  sections, and so on. It is not generally not suitable for serving HTML or
209  XHTML web pages (unless you want to use true XHTML 1.1), for which the
210  ``xhtml`` and ``html`` serializers described below should be preferred.
211
212``xhtml``
213  The ``XHTMLSerializer`` is a specialization of the generic ``XMLSerializer``
214  that understands the pecularities of producing XML-compliant output that can
215  also be parsed without problems by the HTML parsers found in modern web
216  browsers. Thus, the output by this serializer should be usable whether sent
217  as "text/html" or "application/xhtml+html" (although there are a lot of
218  subtle issues to pay attention to when switching between the two, in
219  particular with respect to differences in the DOM and CSS).
220
221  For example, instead of rendering a script tag as ``<script/>`` (which
222  confuses the HTML parser in many browsers), it will produce
223  ``<script></script>``. Also, it will normalize any boolean attributes values
224  that are minimized in HTML, so that for example ``<hr noshade="1"/>``
225  becomes ``<hr noshade="noshade" />``.
226
227  This serializer supports the use of namespaces for compound documents, for
228  example to use inline SVG inside an XHTML document.
229
230``html``
231  The ``HTMLSerializer`` produces proper HTML markup. The main differences
232  compared to ``xhtml`` serialization are that boolean attributes are
233  minimized, empty tags are not self-closing (so it's ``<br>`` instead of
234  ``<br />``), and that the contents of ``<script>`` and ``<style>`` elements
235  are not escaped.
236
237``text``
238  The ``TextSerializer`` produces plain text from markup streams. This is
239  useful primarily for `text templates`_, but can also be used to produce
240  plain text output from markup templates or other sources.
241
242.. _`text templates`: text-templates.html
243
244
[530]245Serialization Options
246---------------------
247
248Both ``serialize()`` and ``render()`` support additional keyword arguments that
249are passed through to the initializer of the serializer class. The following
250options are supported by the built-in serializers:
251
252``strip_whitespace``
[869]253  Whether the serializer should remove trailing spaces and empty lines.
254  Defaults to ``True``.
[530]255
256  (This option is not available for serialization to plain text.)
257
258``doctype``
259  A ``(name, pubid, sysid)`` tuple defining the name, publid identifier, and
260  system identifier of a ``DOCTYPE`` declaration to prepend to the generated
261  output. If provided, this declaration will override any ``DOCTYPE``
262  declaration in the stream.
263
[869]264  The parameter can also be specified as a string to refer to commonly used
265  doctypes:
266 
267  +-----------------------------+-------------------------------------------+
268  | Shorthand                   | DOCTYPE                                   |
269  +=============================+===========================================+
270  | ``html`` or                 | HTML 4.01 Strict                          |
271  | ``html-strict``             |                                           |
272  +-----------------------------+-------------------------------------------+
273  | ``html-transitional``       | HTML 4.01 Transitional                    |
274  +-----------------------------+-------------------------------------------+
275  | ``html-frameset``           | HTML 4.01 Frameset                        |
276  +-----------------------------+-------------------------------------------+
277  | ``html5``                   | DOCTYPE proposed for the work-in-progress |
278  |                             | HTML5 standard                            |
279  +-----------------------------+-------------------------------------------+
280  | ``xhtml`` or                | XHTML 1.0 Strict                          |
281  | ``xhtml-strict``            |                                           |
282  +-----------------------------+-------------------------------------------+
283  | ``xhtml-transitional``      | XHTML 1.0 Transitional                    |
284  +-----------------------------+-------------------------------------------+
285  | ``xhtml-frameset``          | XHTML 1.0 Frameset                        |
286  +-----------------------------+-------------------------------------------+
287  | ``xhtml11``                 | XHTML 1.1                                 |
288  +-----------------------------+-------------------------------------------+
289  | ``svg`` or ``svg-full``     | SVG 1.1                                   |
290  +-----------------------------+-------------------------------------------+
291  | ``svg-basic``               | SVG 1.1 Basic                             |
292  +-----------------------------+-------------------------------------------+
293  | ``svg-tiny``                | SVG 1.1 Tiny                              |
294  +-----------------------------+-------------------------------------------+
295
[530]296  (This option is not available for serialization to plain text.)
297
298``namespace_prefixes``
299  The namespace prefixes to use for namespace that are not bound to a prefix
300  in the stream itself.
301
302  (This option is not available for serialization to HTML or plain text.)
303
[853]304``drop_xml_decl``
305  Whether to remove the XML declaration (the ``<?xml ?>`` part at the
306  beginning of a document) when serializing. This defaults to ``True`` as an
307  XML declaration throws some older browsers into "Quirks" rendering mode.
[530]308
[853]309  (This option is only available for serialization to XHTML.)
[530]310
[869]311``strip_markup``
312  Whether the text serializer should detect and remove any tags or entity
313  encoded characters in the text.
[853]314
[869]315  (This option is only available for serialization to plain text.)
[853]316
[869]317
318
[283]319Using XPath
320===========
321
322XPath can be used to extract a specific subset of the stream via the
[612]323``select()`` method:
[283]324
[614]325.. code-block:: pycon
[612]326
[283]327  >>> substream = stream.select('a')
328  >>> substream
[464]329  <genshi.core.Stream object at ...>
[1076]330  >>> print(substream)
[283]331  <a href="http://example.org/">a link</a>
332
333Often, streams cannot be reused: in the above example, the sub-stream is based
334on a generator. Once it has been serialized, it will have been fully consumed,
335and cannot be rendered again. To work around this, you can wrap such a stream
[612]336in a ``list``:
[283]337
[614]338.. code-block:: pycon
[612]339
[287]340  >>> from genshi import Stream
[283]341  >>> substream = Stream(list(stream.select('a')))
342  >>> substream
[464]343  <genshi.core.Stream object at ...>
[1076]344  >>> print(substream)
[283]345  <a href="http://example.org/">a link</a>
[1076]346  >>> print(substream.select('@href'))
[283]347  http://example.org/
[1076]348  >>> print(substream.select('text()'))
[283]349  a link
[464]350
351See `Using XPath in Genshi`_ for more information about the XPath support in
352Genshi.
353
354.. _`Using XPath in Genshi`: xpath.html
355
356
357.. _`event kinds`:
358
359Event Kinds
360===========
361
362Every event in a stream is of one of several *kinds*, which also determines
363what the ``data`` item of the event tuple looks like. The different kinds of
364events are documented below.
365
[478]366.. note:: The ``data`` item is generally immutable. If the data is to be
[464]367   modified when processing a stream, it must be replaced by a new tuple.
368   Effectively, this means the entire event tuple is immutable.
369
370START
371-----
372The opening tag of an element.
373
374For this kind of event, the ``data`` item is a tuple of the form
375``(tagname, attrs)``, where ``tagname`` is a ``QName`` instance describing the
376qualified name of the tag, and ``attrs`` is an ``Attrs`` instance containing
377the attribute names and values associated with the tag (excluding namespace
[612]378declarations):
[464]379
[612]380.. code-block:: python
381
[1080]382  START, (QName('p'), Attrs([(QName('class'), u'intro')])), pos
[464]383
384END
385---
386The closing tag of an element.
387
388The ``data`` item of end events consists of just a ``QName`` instance
[612]389describing the qualified name of the tag:
[464]390
[612]391.. code-block:: python
392
[1080]393  END, QName('p'), pos
[464]394
395TEXT
396----
[478]397Character data outside of elements and comments.
[464]398
[612]399For text events, the ``data`` item should be a unicode object:
[464]400
[612]401.. code-block:: python
402
[464]403  TEXT, u'Hello, world!', pos
404
405START_NS
406--------
407The start of a namespace mapping, binding a namespace prefix to a URI.
408
409The ``data`` item of this kind of event is a tuple of the form
410``(prefix, uri)``, where ``prefix`` is the namespace prefix and ``uri`` is the
411full URI to which the prefix is bound. Both should be unicode objects. If the
[612]412namespace is not bound to any prefix, the ``prefix`` item is an empty string:
[464]413
[612]414.. code-block:: python
415
[464]416  START_NS, (u'svg', u'http://www.w3.org/2000/svg'), pos
417
418END_NS
419------
420The end of a namespace mapping.
421
422The ``data`` item of such events consists of only the namespace prefix (a
[612]423unicode object):
[464]424
[612]425.. code-block:: python
426
[464]427  END_NS, u'svg', pos
428
429DOCTYPE
430-------
431A document type declaration.
432
433For this type of event, the ``data`` item is a tuple of the form
434``(name, pubid, sysid)``, where ``name`` is the name of the root element,
435``pubid`` is the public identifier of the DTD (or ``None``), and ``sysid`` is
[612]436the system identifier of the DTD (or ``None``):
[464]437
[612]438.. code-block:: python
439
[464]440  DOCTYPE, (u'html', u'-//W3C//DTD XHTML 1.0 Transitional//EN', \
441            u'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'), pos
442
443COMMENT
444-------
445A comment.
446
447For such events, the ``data`` item is a unicode object containing all character
[612]448data between the comment delimiters:
[464]449
[612]450.. code-block:: python
451
[464]452  COMMENT, u'Commented out', pos
453
454PI
455--
456A processing instruction.
457
458The ``data`` item is a tuple of the form ``(target, data)`` for processing
459instructions, where ``target`` is the target of the PI (used to identify the
460application by which the instruction should be processed), and ``data`` is text
[612]461following the target (excluding the terminating question mark):
[464]462
[612]463.. code-block:: python
464
[464]465  PI, (u'php', u'echo "Yo" '), pos
466
467START_CDATA
468-----------
469Marks the beginning of a ``CDATA`` section.
470
[612]471The ``data`` item for such events is always ``None``:
[464]472
[612]473.. code-block:: python
474
[464]475  START_CDATA, None, pos
476
477END_CDATA
478---------
479Marks the end of a ``CDATA`` section.
480
[612]481The ``data`` item for such events is always ``None``:
[464]482
[612]483.. code-block:: python
484
[464]485  END_CDATA, None, pos
Note: See TracBrowser for help on using the repository browser.