Markup Streams
A stream is the common representation of markup as a stream of events.
1 Basics
A stream can be attained in a number of ways. It can be:
- the result of parsing XML or HTML text, or
- the result of selecting a subset of another stream using XPath, or
- programmatically generated.
For example, the functions XML() and HTML() can be used to convert literal XML or HTML text to a markup stream:
>>> from genshi import XML >>> stream = XML('<p class="intro">Some text and ' ... '<a href="http://example.org/">a link</a>.' ... '<br/></p>') >>> stream <genshi.core.Stream object at ...>
The stream is the result of parsing the text into events. Each event is a tuple of the form (kind, data, pos), where:
- kind defines what kind of event it is (such as the start of an element, text, a comment, etc).
- data is the actual data associated with the event. How this looks depends on the event kind (see event kinds)
- pos is a (filename, lineno, column) tuple that describes where the event “comes from”.
>>> for kind, data, pos in stream: ... print kind, `data`, pos ... START (QName(u'p'), Attrs([(QName(u'class'), u'intro')])) (None, 1, 0) TEXT u'Some text and ' (None, 1, 17) START (QName(u'a'), Attrs([(QName(u'href'), u'http://example.org/')])) (None, 1, 31) TEXT u'a link' (None, 1, 61) END QName(u'a') (None, 1, 67) TEXT u'.' (None, 1, 71) START (QName(u'br'), Attrs()) (None, 1, 72) END QName(u'br') (None, 1, 77) END QName(u'p') (None, 1, 77)
2 Filtering
One important feature of markup streams is that you can apply filters to the stream, either filters that come with Genshi, or your own custom filters.
A filter is simply a callable that accepts the stream as parameter, and returns the filtered stream:
def noop(stream): """A filter that doesn't actually do anything with the stream.""" for kind, data, pos in stream: yield kind, data, pos
Filters can be applied in a number of ways. The simplest is to just call the filter directly:
stream = noop(stream)
The Stream class also provides a filter() method, which takes an arbitrary number of filter callables and applies them all:
stream = stream.filter(noop)
Finally, filters can also be applied using the bitwise or operator (|), which allows a syntax similar to pipes on Unix shells:
stream = stream | noop
One example of a filter included with Genshi is the HTMLSanitizer in genshi.filters. It processes a stream of HTML markup, and strips out any potentially dangerous constructs, such as Javascript event handlers. HTMLSanitizer is not a function, but rather a class that implements __call__, which means instances of the class are callable:
stream = stream | HTMLSanitizer()
Both the filter() method and the pipe operator allow easy chaining of filters:
from genshi.filters import HTMLSanitizer stream = stream.filter(noop, HTMLSanitizer())
That is equivalent to:
stream = stream | noop | HTMLSanitizer()
For more information about the built-in filters, see Stream Filters.
3 Serialization
Serialization means producing some kind of textual output from a stream of events, which you'll need when you want to transmit or store the results of generating or otherwise processing markup.
The Stream class provides two methods for serialization: serialize() and render(). The former is a generator that yields chunks of Markup objects (which are basically unicode strings that are considered safe for output on the web). The latter returns a single string, by default UTF-8 encoded.
Here's the output from serialize():
>>> for output in stream.serialize(): ... print `output` ... <Markup u'<p class="intro">'> <Markup u'Some text and '> <Markup u'<a href="http://example.org/">'> <Markup u'a link'> <Markup u'</a>'> <Markup u'.'> <Markup u'<br/>'> <Markup u'</p>'>
And here's the output from render():
>>> print stream.render() <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>
Both methods can be passed a method parameter that determines how exactly the events are serialzed to text. This parameter can be either “xml” (the default), “xhtml”, “html”, “text”, or a custom serializer class:
>>> print stream.render('html') <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p>
Note how the <br> element isn't closed, which is the right thing to do for HTML.
In addition, the render() method takes an encoding parameter, which defaults to “UTF-8”. If set to None, the result will be a unicode string.
The different serializer classes in genshi.output can also be used directly:
>>> from genshi.filters import HTMLSanitizer >>> from genshi.output import TextSerializer >>> print ''.join(TextSerializer()(HTMLSanitizer()(stream))) Some text and a link.
The pipe operator allows a nicer syntax:
>>> print stream | HTMLSanitizer() | TextSerializer() Some text and a link.
3.1 Serialization Options
Both serialize() and render() support additional keyword arguments that are passed through to the initializer of the serializer class. The following options are supported by the built-in serializers:
- strip_whitespace
Whether the serializer should remove trailing spaces and empty lines. Defaults to True.
(This option is not available for serialization to plain text.)
- doctype
A (name, pubid, sysid) tuple defining the name, publid identifier, and system identifier of a DOCTYPE declaration to prepend to the generated output. If provided, this declaration will override any DOCTYPE declaration in the stream.
(This option is not available for serialization to plain text.)
- namespace_prefixes
The namespace prefixes to use for namespace that are not bound to a prefix in the stream itself.
(This option is not available for serialization to HTML or plain text.)
4 Using XPath
XPath can be used to extract a specific subset of the stream via the select() method:
>>> substream = stream.select('a') >>> substream <genshi.core.Stream object at ...> >>> print substream <a href="http://example.org/">a link</a>
Often, streams cannot be reused: in the above example, the sub-stream is based on a generator. Once it has been serialized, it will have been fully consumed, and cannot be rendered again. To work around this, you can wrap such a stream in a list:
>>> from genshi import Stream >>> substream = Stream(list(stream.select('a'))) >>> substream <genshi.core.Stream object at ...> >>> print substream <a href="http://example.org/">a link</a> >>> print substream.select('@href') http://example.org/ >>> print substream.select('text()') a link
See Using XPath in Genshi for more information about the XPath support in Genshi.
5 Event Kinds
Every event in a stream is of one of several kinds, which also determines what the data item of the event tuple looks like. The different kinds of events are documented below.
Note
The data item is generally immutable. If the data is to be modified when processing a stream, it must be replaced by a new tuple. Effectively, this means the entire event tuple is immutable.
5.1 START
The opening tag of an element.
For this kind of event, the data item is a tuple of the form (tagname, attrs), where tagname is a QName instance describing the qualified name of the tag, and attrs is an Attrs instance containing the attribute names and values associated with the tag (excluding namespace declarations):
START, (QName(u'p'), Attrs([(u'class', u'intro')])), pos
5.2 END
The closing tag of an element.
The data item of end events consists of just a QName instance describing the qualified name of the tag:
END, QName(u'p'), pos
5.3 TEXT
Character data outside of elements and comments.
For text events, the data item should be a unicode object:
TEXT, u'Hello, world!', pos
5.4 START_NS
The start of a namespace mapping, binding a namespace prefix to a URI.
The data item of this kind of event is a tuple of the form (prefix, uri), where prefix is the namespace prefix and uri is the full URI to which the prefix is bound. Both should be unicode objects. If the namespace is not bound to any prefix, the prefix item is an empty string:
START_NS, (u'svg', u'http://www.w3.org/2000/svg'), pos
5.5 END_NS
The end of a namespace mapping.
The data item of such events consists of only the namespace prefix (a unicode object):
END_NS, u'svg', pos
5.6 DOCTYPE
A document type declaration.
For this type of event, the data item is a tuple of the form (name, pubid, sysid), where name is the name of the root element, pubid is the public identifier of the DTD (or None), and sysid is the system identifier of the DTD (or None):
DOCTYPE, (u'html', u'-//W3C//DTD XHTML 1.0 Transitional//EN', \ u'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'), pos
5.7 COMMENT
A comment.
For such events, the data item is a unicode object containing all character data between the comment delimiters:
COMMENT, u'Commented out', pos
5.8 PI
A processing instruction.
The data item is a tuple of the form (target, data) for processing instructions, where target is the target of the PI (used to identify the application by which the instruction should be processed), and data is text following the target (excluding the terminating question mark):
PI, (u'php', u'echo "Yo" '), pos
5.9 START_CDATA
Marks the beginning of a CDATA section.
The data item for such events is always None:
START_CDATA, None, pos
5.10 END_CDATA
Marks the end of a CDATA section.
The data item for such events is always None:
END_CDATA, None, pos
See also: genshi.core, Documentation