Version 3 (modified by cmlenz, 18 years ago) (diff) |
---|
Markup Streams
A stream? is the common representation of markup as a stream of events.
A stream can be attained in a number of ways. It can be:
- the result of parsing XML or HTML text, or
- programmatically generated, or
- the result of selecting a subset of another stream filtered by an XPath expression.
For example, the functions XML() and HTML() can be used to convert literal XML or HTML text to a markup stream:
>>> from markup import XML >>> stream = XML('<p class="intro">Some text and ' ... '<a href="http://example.org/">a link</a>.' ... '<br/></p>') >>> stream <markup.core.Stream object at 0x6bef0>
The stream is the result of parsing the text into events. Each event is a tuple of the form (kind, data, pos), where:
- kind defines what kind of event it is (such as the start of an element, text, a comment, etc).
- data is the actual data associated with the event. How this looks depends on the event kind.
- pos is a (filename, lineno, column) tuple that describes where the event “comes from”.
>>> for kind, data, pos in stream: ... print kind, `data`, pos ... START (u'p', [(u'class', u'intro')]) ('<string>', 1, 0) TEXT u'Some text and ' ('<string>', 1, 31) START (u'a', [(u'href', u'http://example.org/')]) ('<string>', 1, 31) TEXT u'a link' ('<string>', 1, 67) END u'a' ('<string>', 1, 67) TEXT u'.' ('<string>', 1, 72) START (u'br', []) ('<string>', 1, 72) END u'br' ('<string>', 1, 77) END u'p' ('<string>', 1, 77)
Serialization
The Stream class provides two methods for serializing this list of events: serialize()? and render()?. The former is a generator that yields chunks of Markup objects (which are basically unicode strings). The latter returns a single string, by default UTF-8 encoded.
Here's the output from serialize():
>>> for output in stream.serialize(): ... print `output` ... <Markup u'<p class="intro">'> <Markup u'Some text and '> <Markup u'<a href="http://example.org/">'> <Markup u'a link'> <Markup u'</a>'> <Markup u'.'> <Markup u'<br/>'> <Markup u'</p>'>
And here's the output from render():
>>> print stream.render() <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>
Both methods can be passed a method parameter that determines how exactly the events are serialzed to text. This parameter can be either “xml” (the default) or “html”, or a subclass of the markup.output.Serializer class:
>>> print stream.render('html') <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p>
(Note how the <br> element isn't closed, which is the right thing to do for HTML.)
In addition, the render() method takes an encoding parameter, which defaults to “UTF-8”. If set to None, the result will be a unicode string.
Using XPath
XPath can be used to extract a specific subset of the stream via the select() method:
>>> substream = stream.select('a') >>> substream <markup.core.Stream object at 0x7118b0> >>> print substream <a href="http://example.org/">a link</a>
Often, streams cannot be reused: in the above example, the sub-stream is based on a generator. Once it has been serialized, it will have been fully consumed, and cannot be rendered again. To work around this, you can wrap such a “read-once” stream in a list:
>>> from markup import Stream >>> substream = Stream(list(stream.select('a'))) >>> substream <markup.core.Stream object at 0x7118b0> >>> print substream <a href="http://example.org/">a link</a> >>> print substream.select('@href') http://example.org/ >>> print substream.select('text()') a link
See also: MarkupGuide?