Opened 13 years ago
Last modified 13 years ago
#429 closed defect
Stream.render bug when strip_whitespace=False — at Version 5
Reported by: | cboos | Owned by: | hodgestar |
---|---|---|---|
Priority: | major | Milestone: | 0.6.1 |
Component: | Serialization | Version: | 0.6 |
Keywords: | cache | Cc: |
Description (last modified by cboos)
For the context, I noticed that TracHacks:WikiGoodiesPlugin's ShowEntities macro was broken, as it displayed things like:
… | … |
instead of the expected:
… | … |
I traced that to the strip_whitespace parameter of Stream.render:
>>> from genshi.builder import tag >>> from genshi.core import Markup >>> table = tag.table(tag.tr(tag.th('…')), tag.td(Markup('…'))) >>> list(table.generate()) [('START', (QName('table'), Attrs()), (None, -1, -1)), ('START', (QName('tr'), Attrs()), (None, -1, -1)), ('START', (QName('th'), Attrs()), (None, -1, -1)), ('TEXT', '…', (None, -1, -1)), ('END', QName('th'), (None, -1, -1)), ('END', QName('tr'), (None, -1, -1)), ('START', (QName('td'), Attrs()), (None, -1, -1)), ('TEXT', <Markup u'…'>, (None, -1, -1)), ('END', QName('td'), (None, -1, -1)), ('END', QName('table'), (None, -1, -1))] >>> table.generate().render('xml') '<table><tr><th>&hellip;</th></tr><td>…</td></table>'
All fine so far, but with strip_whitespace=False we get the unexpected:
>>> table.generate().render('xml', encoding=None, strip_whitespace=False) u'<table><tr><th>&hellip;</th></tr><td>&hellip;</td></table>'
I didn't investigate further yet, what strip_whitespace=False does is to prevent the WhitespaceFilter to be added to the XMLSerializer filters. So the bug happens only when that filter is not present... strange.
Oh wait, this could be a problem with the string cache. I remember when reviewing [1038] that I was afraid that the cache wouldn't make a difference between escaped and unescaped strings, but I was not able to come up with a test case (maybe because I tested with the default strip_whitespace=True). To be continued...
Change History (6)
Changed 13 years ago by cboos
comment:1 follow-up: ↓ 3 Changed 13 years ago by cboos
A simple fix would be:
-
genshi/output.py
212 212 cache_get = cache.get 213 213 if self.cache: 214 214 def _emit(kind, input, output): 215 cache[kind, input ] = output215 cache[kind, input, isinstance(input, Markup)] = output 216 216 return output 217 217 else: 218 218 def _emit(kind, input, output): … … 221 221 for filter_ in self.filters: 222 222 stream = filter_(stream) 223 223 for kind, data, pos in stream: 224 cached = cache_get((kind, data ))224 cached = cache_get((kind, data, isinstance(data, Markup))) 225 225 if cached is not None: 226 226 yield cached
But there's much code duplication with the cache so I'll probably try to refactor things first.
comment:2 Changed 13 years ago by cboos
- Keywords cache added
Oh, and the reason the presence of the WhitespaceFilter makes the test pass is because in my test the two TEXT tokens are accumulated in a single <Markup u'&hellip;…'> TEXT.
It's interesting to see what happens in the following test:
def test_cache_markup(self): loc = (None, -1, -1) stream = Stream([(Stream.START, (QName('foo'), Attrs()), loc), (Stream.TEXT, u'…', loc), (Stream.END, QName('foo'), loc), (Stream.START, (QName('bar'), Attrs()), loc), (Stream.TEXT, Markup('…'), loc), (Stream.END, QName('bar'), loc)]) output = stream.render(XMLSerializer, encoding=None, strip_whitespace=False) self.assertEqual('<foo>&hellip;</foo><bar>…</bar>', output)
Here, when strip_whitespace=False, the &hellip in <foo> will be seen as u'…' which hashes the same as the Markup('…') in <bar>. This explains why in <bar> we'll retrieve the previously cached output &hellip;!
OTOH, when strip_whitespace=True, the u'… in <foo> will be seen as <Markup u'&hellip;'>, and therefore we won't have a collision in the cache.
An alternative fix would be to use the escaped input as a key for the hash instead of the isinstance test, but it remains to be seen if this would be faster or not (escape also does a type(text) is cls).
comment:3 in reply to: ↑ 1 ; follow-up: ↓ 4 Changed 13 years ago by hodgestar
- Owner changed from cmlenz to hodgestar
- Status changed from new to assigned
Replying to cboos:
A simple fix would be:
I'm happy to apply the simple fix (and tests) if you want but I'll wait for the potential refactor or a go-ahead from you.
But there's much code duplication with the cache so I'll probably try to refactor things first.
Wow. I see what you mean. On the other hand I don't see a simple way to refactor it without adding a lot more function calls.
Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.
comment:4 in reply to: ↑ 3 Changed 13 years ago by cboos
Replying to hodgestar:
Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.
While this will probably work in this specific situation, I would be very cautious with such a change:
- there might places which rely on equality between Markup strings and unicode strings (much the same way as 'string' == u'string')
- those very basic operations would have to be done in Python, instead of directly in C (those methods are for now inherited from unicode); this might have a non negligible performance impact overall
If b. happens to not be a blocker, then this approach would at least need to wait for a major version change (0.7) because of a.
The simpler fix OTOH could make it for 0.6.1.
comment:5 Changed 13 years ago by cboos
- Description modified (diff)
those tests pass on r1037 but test_cache_markup_strip_whitespace fails with r1038