Edgewall Software

Opened 18 years ago

Last modified 17 years ago

#70 new enhancement

Genshi Markup to lxml fast converter

Reported by: ianb@… Owned by: cmlenz
Priority: major Milestone:
Component: General Version: 0.3.3
Keywords: helpwanted Cc: ianb@…

Description

I'm doing a lot of stuff with lxml now, much of which takes the form of a pipeline, transforming output through multiple stages. There's opportunities to do this very efficiently if the markup isn't constantly serialized and reparsed. lxml itself is uniquely qualified for this role -- in part because of the tools it has, but also largely because it has a pretty good HTML parser.

Anyway, the sad part is that nothing produces lxml output currently except other lxml tools. Genshi doesn't either, for reasons I understand (even if I'm a little suspicious if they really apply to realistic situations). But this wouldn't be too big a problem if Genshi had a fast way to transform its markup to lxml without a serialization step. (Pyrex even? Even a Python transformation would be fast, I'm sure)

Anyway, that's what I'm suggesting here.

Change History (3)

comment:1 Changed 18 years ago by cmlenz

  • Keywords helpwanted added
  • Milestone 0.4 deleted

Would be nice... patch, anyone? :-)

comment:2 Changed 17 years ago by matt@…

Do you mean a function/class that would take a Genshi stream and return an lxml ElementTree? If so, it shouldn't be too hard to write, but it should probably be solved in the general case as a Genshi to SAX event converter. Then you could use lxml's lxml.sax.ElementTreeContentHandler interface to do Genshi to lxml...

http://codespeak.net/lxml/sax.html

But, given the slowness of Python looping, it might actually be slower than serializing and reparsing.

comment:3 Changed 17 years ago by ianb@…

After doing some benchmarks, serialization and re-parsing could very well be the fastest way of creating an lxml tree. The lxml parsing will probably be a very small part of the time involved, and the Genshi serialization will be most of the time. Only if you can save time over serialization will a more specific technique be advantageous.

Note: See TracTickets for help on using tickets.