Edgewall Software

Ticket #108 (new enhancement)

Opened 16 months ago

Last modified 13 months ago

[PATCH] Add HTML5 (WHATWG) support for input and output

Reported by: tbroyer Owned by: cmlenz
Priority: major Milestone:
Component: General Version: 0.3.6
Keywords: Cc:

Description

The attached patch adds an HTML5Parser and HTML5() function to genshi.input, based on html5lib (the import is done only when you instantiate the HTML5Parser), and an HTML5Serializer to genshi.output, using the algorithm from the HTML5 spec, except that PIs are serialized in SGML form rather than throwing an exception.

Attachments

html5_support.diff (8.9 kB) - added by tbroyer 16 months ago.
Adds HTML5 parsing and serialization support
html5_support.2.diff (12.9 kB) - added by tbroyer 16 months ago.
Revised patch (better handling of elements/attributes outside the HTML namespace in output) + adds an HTML5Template class

Change History

Changed 16 months ago by tbroyer

Adds HTML5 parsing and serialization support

  Changed 16 months ago by cmlenz

A couple questions:

  • What exactly do we gain by using html5lib instead of HTMLParser? I.e. are there cases where HTMLParser can't deal correctly with valid/well-formed HTML5 markup?
  • Can't the HTML5 serializer merged into the existing HTML serializer? I.e. add the DOCTYPE and a couple of attributes. Again, are there any concrete cases where a slightly enhanced HTMLSerializer would produce output that isn't HTML5 compliant?

Thanks!

follow-up: ↓ 3   Changed 16 months ago by tbroyer

What exactly do we gain by using html5lib instead of HTMLParser? I.e. are there cases where HTMLParser can't deal correctly with valid/well-formed HTML5 markup?

html5lib follows the HTML5 spec for parsing, and HTML5 has been designed to follow current web browsers behaviors, so parsing HTML5 with html5lib will give you the same DOM as you'd have when opening the document in your web browser.

Some things html5lib does that HTMLParser doesn't:

  • html5lib reconstructs elements with omitted tags (e.g. <html>, <head> and <body> have both their start and end tags optional; html5lib will however generate trees with an <html> root and two children <head> and <body>),
  • it moves some elements when they should have gone (e.g. <meta> and <link> are moved into <head> even if found in <body>)
  • it fixes some other things too in malformed HTML (elements outside <td> in <table>s, elements inside <select>, etc.)
  • html5lib always recover from bad markup and never throws an exception (except if you tell it to)

The dependency on a new library (moreover an in-dev one) and the fact that this library doesnt allow parsing of HTML fragment are IMO good-enough reasons to having a separate HTML5Parser.

Can't the HTML5 serializer merged into the existing HTML serializer? I.e. add the DOCTYPE and a couple of attributes. Again, are there any concrete cases where a slightly enhanced HTMLSerializer would produce output that isn't HTML5 compliant?

The proposed HTML5Serializer uses the HTML5 algorithm for serializing HTML (well, it appears that the sent implementation is not compliant w.r.t elements not in the HTML namespace; I'll revise my patch later this evening GMT/UTC). The goal is less to produce "valid HTML5" than to comply with the HTML5 spec. Again, the HTML5 spec is based on study and/or reverse-engineering of major current browsers behaviors (Internet Explorer, FireFox?, Opera, Safari, etc.), so following the spec algorithm will give you the same markup as calling "innerHTML" from within your web browser.

in reply to: ↑ 2 ; follow-up: ↓ 4   Changed 16 months ago by cmlenz

Replying to tbroyer:

The proposed HTML5Serializer uses the HTML5 algorithm for serializing HTML (well, it appears that the sent implementation is not compliant w.r.t elements not in the HTML namespace; I'll revise my patch later this evening GMT/UTC). The goal is less to produce "valid HTML5" than to comply with the HTML5 spec. Again, the HTML5 spec is based on study and/or reverse-engineering of major current browsers behaviors (Internet Explorer, FireFox?, Opera, Safari, etc.), so following the spec algorithm will give you the same markup as calling "innerHTML" from within your web browser.

Understood, but where exactly would that differ from the output generated by HTMLSerializer? Ignoring for a second the elements/attributes added in HTML5…

Changed 16 months ago by tbroyer

Revised patch (better handling of elements/attributes outside the HTML namespace in output) + adds an HTML5Template class

in reply to: ↑ 3   Changed 16 months ago by anonymous

Replying to cmlenz:

Understood, but where exactly would that differ from the output generated by HTMLSerializer? Ignoring for a second the elements/attributes added in HTML5…

Well, basically:

  • "empty elements" have their subtree ignored (even if it's not valid HTML, you could very well build a tree with a <link> elements with children; when serialized, those children will be ignored)
  • the list of "noescape elements" is not the same
  • boolean attributes are output using non-minimized form (selected="" or selected="selected")

Also, the current HTMLSerializer:

  • strips every element not in the (X)HTML namespace whereas HTML5Serializer (the revised one attached above) cope with them
  • uses a WhitespaceFilter; There is no such thing as "preserve space elements" in the HTML5 serialization algorithm, so I believe the use of this filter should be left to the user in the same way HTMLSanitizer (passing it between the generate() and render() calls); it might have been necessary in the HTMLSerializer though because of the use of the NamespaceStripper, which would remove xml:space attributes…

Both serializer could very well be merged, though (maybe removing inheritance from XHTMLSerializer: HTMLSerializer actually only inherits class variables such as _EMPTY_ELEMS or _BOOLEAN_ATTRS, since it overrides everything else; and HTML5Serializer uses different values; in other words, it might be a good idea to replace the current HTMLSerializer implementation with the proposed HTML5Serializerone ;-) )


Please note that the revised patch now includes a genshi.template.html5.HTML5Template where any element or attribute whose name starts with py: or py_ is turned into the equivalent directive from the XML Template Language. Elements whose name starts with xi: or xi_ are turned into the equivalent XInclude directives. Only elements and attributes from the (X)HTML namespace or with no namespace are caught (in case you pass a Stream as the template source). I know of some people interested with such a thing. It allows you to work with designers whose tools do not generate XML.


FYI, html5lib includes a "Liberal XML" parser to cope with non-well-formed XML found in the wild; it might be interesting to have a LiberalXMLParser into genshi.output (e.g. if you want to parse RSS or Atom feeds from other sites). This "Liberal XML" parser has been written by Sam Ruby for Planet Venus.

  Changed 16 months ago by cmlenz

  • milestone changed from 0.4 to 0.5

  Changed 16 months ago by t.broyer

I've decided to make this "add-on project" live on it's own at http://code.google.com/p/genshihtml5

This does not mean however HTML5 support for Genshi can't be integrated into Genshi in a later release.

  Changed 13 months ago by cmlenz

  • milestone 0.5 deleted

As this is available as a separate project for now, and I'm not yet sure I want to incorporate all that functionality into Genshi proper, I'm clearing the milestone.

Add/Change #108 ([PATCH] Add HTML5 (WHATWG) support for input and output)

Author



Change Properties
<Author field>
Action
as new
as The resolution will be set. Next status will be 'closed'
to The owner will change. Next status will be 'new'
The owner will change to anonymous. Next status will be 'assigned'
 
Note: See TracTickets for help on using tickets.