Edgewall Software

Opened 15 years ago

Last modified 8 years ago

#108 new enhancement

[PATCH] Add HTML5 (WHATWG) support for input and output

Reported by: tbroyer Owned by: cmlenz
Priority: major Milestone:
Component: General Version: 0.3.6
Keywords: Cc: cboos, andref.dias@…, leho@…

Description

The attached patch adds an HTML5Parser and HTML5() function to genshi.input, based on html5lib (the import is done only when you instantiate the HTML5Parser), and an HTML5Serializer to genshi.output, using the algorithm from the HTML5 spec, except that PIs are serialized in SGML form rather than throwing an exception.

Attachments (2)

html5_support.diff (8.9 KB) - added by tbroyer 15 years ago.
Adds HTML5 parsing and serialization support
html5_support.2.diff (12.9 KB) - added by tbroyer 15 years ago.
Revised patch (better handling of elements/attributes outside the HTML namespace in output) + adds an HTML5Template class

Download all attachments as: .zip

Change History (16)

Changed 15 years ago by tbroyer

Adds HTML5 parsing and serialization support

comment:1 Changed 15 years ago by cmlenz

A couple questions:

  • What exactly do we gain by using html5lib instead of HTMLParser? I.e. are there cases where HTMLParser can't deal correctly with valid/well-formed HTML5 markup?
  • Can't the HTML5 serializer merged into the existing HTML serializer? I.e. add the DOCTYPE and a couple of attributes. Again, are there any concrete cases where a slightly enhanced HTMLSerializer would produce output that isn't HTML5 compliant?

Thanks!

comment:2 follow-ups: Changed 15 years ago by tbroyer

What exactly do we gain by using html5lib instead of HTMLParser? I.e. are there cases where HTMLParser can't deal correctly with valid/well-formed HTML5 markup?

html5lib follows the HTML5 spec for parsing, and HTML5 has been designed to follow current web browsers behaviors, so parsing HTML5 with html5lib will give you the same DOM as you'd have when opening the document in your web browser.

Some things html5lib does that HTMLParser doesn't:

  • html5lib reconstructs elements with omitted tags (e.g. <html>, <head> and <body> have both their start and end tags optional; html5lib will however generate trees with an <html> root and two children <head> and <body>),
  • it moves some elements when they should have gone (e.g. <meta> and <link> are moved into <head> even if found in <body>)
  • it fixes some other things too in malformed HTML (elements outside <td> in <table>s, elements inside <select>, etc.)
  • html5lib always recover from bad markup and never throws an exception (except if you tell it to)

The dependency on a new library (moreover an in-dev one) and the fact that this library doesnt allow parsing of HTML fragment are IMO good-enough reasons to having a separate HTML5Parser.

Can't the HTML5 serializer merged into the existing HTML serializer? I.e. add the DOCTYPE and a couple of attributes. Again, are there any concrete cases where a slightly enhanced HTMLSerializer would produce output that isn't HTML5 compliant?

The proposed HTML5Serializer uses the HTML5 algorithm for serializing HTML (well, it appears that the sent implementation is not compliant w.r.t elements not in the HTML namespace; I'll revise my patch later this evening GMT/UTC). The goal is less to produce "valid HTML5" than to comply with the HTML5 spec. Again, the HTML5 spec is based on study and/or reverse-engineering of major current browsers behaviors (Internet Explorer, FireFox?, Opera, Safari, etc.), so following the spec algorithm will give you the same markup as calling "innerHTML" from within your web browser.

comment:3 in reply to: ↑ 2 ; follow-up: Changed 15 years ago by cmlenz

Replying to tbroyer:

The proposed HTML5Serializer uses the HTML5 algorithm for serializing HTML (well, it appears that the sent implementation is not compliant w.r.t elements not in the HTML namespace; I'll revise my patch later this evening GMT/UTC). The goal is less to produce "valid HTML5" than to comply with the HTML5 spec. Again, the HTML5 spec is based on study and/or reverse-engineering of major current browsers behaviors (Internet Explorer, FireFox?, Opera, Safari, etc.), so following the spec algorithm will give you the same markup as calling "innerHTML" from within your web browser.

Understood, but where exactly would that differ from the output generated by HTMLSerializer? Ignoring for a second the elements/attributes added in HTML5…

Changed 15 years ago by tbroyer

Revised patch (better handling of elements/attributes outside the HTML namespace in output) + adds an HTML5Template class

comment:4 in reply to: ↑ 3 Changed 15 years ago by anonymous

Replying to cmlenz:

Understood, but where exactly would that differ from the output generated by HTMLSerializer? Ignoring for a second the elements/attributes added in HTML5…

Well, basically:

  • "empty elements" have their subtree ignored (even if it's not valid HTML, you could very well build a tree with a <link> elements with children; when serialized, those children will be ignored)
  • the list of "noescape elements" is not the same
  • boolean attributes are output using non-minimized form (selected="" or selected="selected")

Also, the current HTMLSerializer:

  • strips every element not in the (X)HTML namespace whereas HTML5Serializer (the revised one attached above) cope with them
  • uses a WhitespaceFilter; There is no such thing as "preserve space elements" in the HTML5 serialization algorithm, so I believe the use of this filter should be left to the user in the same way HTMLSanitizer (passing it between the generate() and render() calls); it might have been necessary in the HTMLSerializer though because of the use of the NamespaceStripper, which would remove xml:space attributes…

Both serializer could very well be merged, though (maybe removing inheritance from XHTMLSerializer: HTMLSerializer actually only inherits class variables such as _EMPTY_ELEMS or _BOOLEAN_ATTRS, since it overrides everything else; and HTML5Serializer uses different values; in other words, it might be a good idea to replace the current HTMLSerializer implementation with the proposed HTML5Serializerone ;-) )


Please note that the revised patch now includes a genshi.template.html5.HTML5Template where any element or attribute whose name starts with py: or py_ is turned into the equivalent directive from the XML Template Language. Elements whose name starts with xi: or xi_ are turned into the equivalent XInclude directives. Only elements and attributes from the (X)HTML namespace or with no namespace are caught (in case you pass a Stream as the template source). I know of some people interested with such a thing. It allows you to work with designers whose tools do not generate XML.


FYI, html5lib includes a "Liberal XML" parser to cope with non-well-formed XML found in the wild; it might be interesting to have a LiberalXMLParser into genshi.output (e.g. if you want to parse RSS or Atom feeds from other sites). This "Liberal XML" parser has been written by Sam Ruby for Planet Venus.

comment:5 Changed 15 years ago by cmlenz

  • Milestone changed from 0.4 to 0.5

comment:6 Changed 15 years ago by t.broyer

I've decided to make this "add-on project" live on it's own at http://code.google.com/p/genshihtml5

This does not mean however HTML5 support for Genshi can't be integrated into Genshi in a later release.

comment:7 Changed 15 years ago by cmlenz

  • Milestone 0.5 deleted

As this is available as a separate project for now, and I'm not yet sure I want to incorporate all that functionality into Genshi proper, I'm clearing the milestone.

comment:8 in reply to: ↑ 2 Changed 12 years ago by anatoly techtonik <techtonik@…>

  • Milestone set to 0.6

Replying to tbroyer:

  • html5lib always recover from bad markup and never throws an exception (except if you tell it to)

The recovery is absolute MUST for library that parses external HTML content. Unfortunately, Genshi fails on bad markup and it effectively brings down my Trac. See #375

comment:9 Changed 12 years ago by cboos

  • Milestone 0.6 deleted

Please not for 0.6. Furthermore, if cmlenz clears a milestone (comment:7), at best you can try to convince him to re-assign one, not arbitrarily assign one yourself.

comment:10 Changed 12 years ago by anatoly techtonik <techtonik@…>

Ok. I just want to make sure it won't be forgotten in a pile of "enhancements", because for me it is a defect.

comment:11 Changed 11 years ago by cboos

  • Cc cboos added

What's the status of this?

Genshi claims to support HTML5 since r540, however the changes look rather minimal compared to the patches proposed here.

Would be interesting to have for future versions of Trac (#T3416, #T9127).

comment:12 Changed 11 years ago by hodgestar

Since the patch just adds new functionality I see no reason why in principle the functionality couldn't go in. There are a couple of things that should probably be corrected beforehand though:

  • The new HTML5Parser makes recursive calls which is (a) slow and (b) limits one to HTML5 documents with no more than sys.getrecursionlimit() levels of child elements.
  • As far as I can tell the HTML5Serializer in the patch is never used (since that would require a change to get_serializer and a serializer attribute on the new HTML5Template class).

One the plus side html5lib appears to still be actively maintained.

There is an important caveat to html5lib though -- it does not support SAX-style parsing. Creating a SAX-event parser that does the subtree-rearranging required by the HTML5 parsing spec may be a fruitless exercise -- certainly I imagine there are worst case scenarios where the entire document must be examined before the first few SAX events can be emitted. It's not clear to me whether this in an issue in practice but it does rather go against Genshi's stream-based approach to templating.

comment:13 Changed 11 years ago by andref

  • Cc andref.dias@… added

comment:14 Changed 8 years ago by lkraav <leho@…>

  • Cc leho@… added
Note: See TracTickets for help on using tickets.