Opened 18 years ago
Last modified 17 years ago
#70 new enhancement
Genshi Markup to lxml fast converter
Reported by: | ianb@… | Owned by: | cmlenz |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | General | Version: | 0.3.3 |
Keywords: | helpwanted | Cc: | ianb@… |
Description
I'm doing a lot of stuff with lxml now, much of which takes the form of a pipeline, transforming output through multiple stages. There's opportunities to do this very efficiently if the markup isn't constantly serialized and reparsed. lxml itself is uniquely qualified for this role -- in part because of the tools it has, but also largely because it has a pretty good HTML parser.
Anyway, the sad part is that nothing produces lxml output currently except other lxml tools. Genshi doesn't either, for reasons I understand (even if I'm a little suspicious if they really apply to realistic situations). But this wouldn't be too big a problem if Genshi had a fast way to transform its markup to lxml without a serialization step. (Pyrex even? Even a Python transformation would be fast, I'm sure)
Anyway, that's what I'm suggesting here.
Change History (3)
comment:1 Changed 18 years ago by cmlenz
- Keywords helpwanted added
- Milestone 0.4 deleted
comment:2 Changed 17 years ago by matt@…
Do you mean a function/class that would take a Genshi stream and return an lxml ElementTree? If so, it shouldn't be too hard to write, but it should probably be solved in the general case as a Genshi to SAX event converter. Then you could use lxml's lxml.sax.ElementTreeContentHandler interface to do Genshi to lxml...
http://codespeak.net/lxml/sax.html
But, given the slowness of Python looping, it might actually be slower than serializing and reparsing.
comment:3 Changed 17 years ago by ianb@…
After doing some benchmarks, serialization and re-parsing could very well be the fastest way of creating an lxml tree. The lxml parsing will probably be a very small part of the time involved, and the Genshi serialization will be most of the time. Only if you can save time over serialization will a more specific technique be advantageous.
Would be nice... patch, anyone? :-)