Edgewall Software

genshi.filters.html

Implementation of a number of stream filters.

HTMLFormFiller

A stream filter that can populate HTML forms from a dictionary of values.

>>> from genshi.input import HTML
>>> html = HTML('''<form>
...   <p><input type="text" name="foo" /></p>
... </form>''', encoding='utf-8')
>>> filler = HTMLFormFiller(data={'foo': 'bar'})
>>> print(html | filler)
<form>
  <p><input type="text" name="foo" value="bar"/></p>
</form>

HTMLSanitizer

A filter that removes potentially dangerous HTML tags and attributes from the stream.

>>> from genshi import HTML
>>> html = HTML('<div><script>alert(document.cookie)</script></div>', encoding='utf-8')
>>> print(html | HTMLSanitizer())
<div/>

The default set of safe tags and attributes can be modified when the filter is instantiated. For example, to allow inline style attributes, the following instantation would work:

>>> html = HTML('<div style="background: #000"></div>', encoding='utf-8')
>>> sanitizer = HTMLSanitizer(safe_attrs=HTMLSanitizer.SAFE_ATTRS | set(['style']))
>>> print(html | sanitizer)
<div style="background: #000"/>

Note that even in this case, the filter does attempt to remove dangerous constructs from style attributes:

>>> html = HTML('<div style="background: url(javascript:void); color: #000"></div>', encoding='utf-8')
>>> print(html | sanitizer)
<div style="color: #000"/>

This handles HTML entities, unicode escapes in CSS and Javascript text, as well as a lot of other things. However, the style tag is still excluded by default because it is very hard for such sanitizing to be completely safe, especially considering how much error recovery current web browsers perform.

It also does some basic filtering of CSS properties that may be used for typical phishing attacks. For more sophisticated filtering, this class provides a couple of hooks that can be overridden in sub-classes.

warn:Note that this special processing of CSS is currently only applied to style attributes, not style elements.

is_safe_css(self, propname, value)

Determine whether the given css property declaration is to be considered safe for inclusion in the output.

param propname:the CSS property name
param value:the value of the property
return:whether the property value should be considered safe
rtype:bool
since:version 0.6

is_safe_elem(self, tag, attrs)

Determine whether the given element should be considered safe for inclusion in the output.

param tag:the tag name of the element
type tag:QName
param attrs:the element attributes
type attrs:Attrs
return:whether the element should be considered safe
rtype:bool
since:version 0.6

is_safe_uri(self, uri)

Determine whether the given URI is to be considered safe for inclusion in the output.

The default implementation checks whether the scheme of the URI is in the set of allowed URIs (safe_schemes).

>>> sanitizer = HTMLSanitizer()
>>> sanitizer.is_safe_uri('http://example.org/')
True
>>> sanitizer.is_safe_uri('javascript:alert(document.cookie)')
False
param uri:the URI to check
return:True if the URI can be considered safe, False otherwise
rtype:bool
since:version 0.4.3

sanitize_css(self, text)

Remove potentially dangerous property declarations from CSS code.

In particular, properties using the CSS url() function with a scheme that is not considered safe are removed:

>>> sanitizer = HTMLSanitizer()
>>> sanitizer.sanitize_css(u'''
...   background: url(javascript:alert("foo"));
...   color: #000;
... ''')
[u'color: #000']

Also, the proprietary Internet Explorer function expression() is always stripped:

>>> sanitizer.sanitize_css(u'''
...   background: #fff;
...   color: #000;
...   width: e/**/xpression(alert("foo"));
... ''')
[u'background: #fff', u'color: #000']
param text:the CSS text; this is expected to be unicode and to not contain any character or numeric references
return:a list of declarations that are considered safe
rtype:list
since:version 0.4.3


See ApiDocs, Documentation, Documentation/filters.html

Last modified 8 years ago Last modified on Dec 10, 2015, 6:15:05 AM