#429 closed defect (fixed)
Stream.render bug when strip_whitespace=False
Reported by: | cboos | Owned by: | hodgestar |
---|---|---|---|
Priority: | major | Milestone: | 0.6.1 |
Component: | Serialization | Version: | 0.6 |
Keywords: | cache | Cc: |
Description (last modified by cboos)
For the context, I noticed that TracHacks:WikiGoodiesPlugin's ShowEntities macro was broken, as it displayed things like:
… | … |
instead of the expected:
… | … |
I traced that to the strip_whitespace parameter of Stream.render:
>>> from genshi.builder import tag >>> from genshi.core import Markup >>> table = tag.table(tag.tr(tag.th('…')), tag.td(Markup('…'))) >>> list(table.generate()) [('START', (QName('table'), Attrs()), (None, -1, -1)), ('START', (QName('tr'), Attrs()), (None, -1, -1)), ('START', (QName('th'), Attrs()), (None, -1, -1)), ('TEXT', '…', (None, -1, -1)), ('END', QName('th'), (None, -1, -1)), ('END', QName('tr'), (None, -1, -1)), ('START', (QName('td'), Attrs()), (None, -1, -1)), ('TEXT', <Markup u'…'>, (None, -1, -1)), ('END', QName('td'), (None, -1, -1)), ('END', QName('table'), (None, -1, -1))] >>> table.generate().render('xml') '<table><tr><th>&hellip;</th></tr><td>…</td></table>'
All fine so far, but with strip_whitespace=False we get the unexpected:
>>> table.generate().render('xml', encoding=None, strip_whitespace=False) u'<table><tr><th>&hellip;</th></tr><td>&hellip;</td></table>'
I didn't investigate further yet, what strip_whitespace=False does is to prevent the WhitespaceFilter to be added to the XMLSerializer filters. So the bug happens only when that filter is not present... strange.
Oh wait, this could be a problem with the string cache. I remember when reviewing [1038] that I was afraid that the cache wouldn't make a difference between escaped and unescaped strings, but I was not able to come up with a test case (maybe because I tested with the default strip_whitespace=True). To be continued...
Attachments (5)
Change History (22)
Changed 13 years ago by cboos
comment:1 follow-up: ↓ 3 Changed 13 years ago by cboos
A simple fix would be:
-
genshi/output.py
212 212 cache_get = cache.get 213 213 if self.cache: 214 214 def _emit(kind, input, output): 215 cache[kind, input ] = output215 cache[kind, input, isinstance(input, Markup)] = output 216 216 return output 217 217 else: 218 218 def _emit(kind, input, output): … … 221 221 for filter_ in self.filters: 222 222 stream = filter_(stream) 223 223 for kind, data, pos in stream: 224 cached = cache_get((kind, data ))224 cached = cache_get((kind, data, isinstance(data, Markup))) 225 225 if cached is not None: 226 226 yield cached
But there's much code duplication with the cache so I'll probably try to refactor things first.
comment:2 Changed 13 years ago by cboos
- Keywords cache added
Oh, and the reason the presence of the WhitespaceFilter makes the test pass is because in my test the two TEXT tokens are accumulated in a single <Markup u'&hellip;…'> TEXT.
It's interesting to see what happens in the following test:
def test_cache_markup(self): loc = (None, -1, -1) stream = Stream([(Stream.START, (QName('foo'), Attrs()), loc), (Stream.TEXT, u'…', loc), (Stream.END, QName('foo'), loc), (Stream.START, (QName('bar'), Attrs()), loc), (Stream.TEXT, Markup('…'), loc), (Stream.END, QName('bar'), loc)]) output = stream.render(XMLSerializer, encoding=None, strip_whitespace=False) self.assertEqual('<foo>&hellip;</foo><bar>…</bar>', output)
Here, when strip_whitespace=False, the &hellip in <foo> will be seen as u'…' which hashes the same as the Markup('…') in <bar>. This explains why in <bar> we'll retrieve the previously cached output &hellip;!
OTOH, when strip_whitespace=True, the u'… in <foo> will be seen as <Markup u'&hellip;'>, and therefore we won't have a collision in the cache.
An alternative fix would be to use the escaped input as a key for the hash instead of the isinstance test, but it remains to be seen if this would be faster or not (escape also does a type(text) is cls).
comment:3 in reply to: ↑ 1 ; follow-up: ↓ 4 Changed 13 years ago by hodgestar
- Owner changed from cmlenz to hodgestar
- Status changed from new to assigned
Replying to cboos:
A simple fix would be:
I'm happy to apply the simple fix (and tests) if you want but I'll wait for the potential refactor or a go-ahead from you.
But there's much code duplication with the cache so I'll probably try to refactor things first.
Wow. I see what you mean. On the other hand I don't see a simple way to refactor it without adding a lot more function calls.
Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.
comment:4 in reply to: ↑ 3 Changed 13 years ago by cboos
Replying to hodgestar:
Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.
While this will probably work in this specific situation, I would be very cautious with such a change:
- there might places which rely on equality between Markup strings and unicode strings (much the same way as 'string' == u'string')
- those very basic operations would have to be done in Python, instead of directly in C (those methods are for now inherited from unicode); this might have a non negligible performance impact overall
If b. happens to not be a blocker, then this approach would at least need to wait for a major version change (0.7) because of a.
The simpler fix OTOH could make it for 0.6.1.
comment:5 Changed 13 years ago by cboos
- Description modified (diff)
comment:6 Changed 13 years ago by cboos
Wait a minute... in the case of TEXT tokens, we always have output == input for Markup instances, don't we?
In this case, the fix becomes:
-
output.py
221 221 for filter_ in self.filters: 222 222 stream = filter_(stream) 223 223 for kind, data, pos in stream: 224 if kind is TEXT and isinstance(data, Markup): 225 yield data 226 continue 224 227 cached = cache_get((kind, data)) 225 228 if cached is not None: 226 229 yield cached
comment:7 Changed 13 years ago by cboos
oops, excuse the Trac 0.12ism ;-)
-
output.py
221 221 for filter_ in self.filters: 222 222 stream = filter_(stream) 223 223 for kind, data, pos in stream: 224 if kind is TEXT and isinstance(data, Markup): 225 yield data 226 continue 224 227 cached = cache_get((kind, data)) 225 228 if cached is not None: 226 229 yield cached
comment:8 Changed 13 years ago by cboos
Please review the two patches above.
In my limited performance testing (min time for running the tests), the timing didn't change much, it even seemed a little bit faster with the patches... I would be interested in your numbers.
comment:9 follow-ups: ↓ 10 ↓ 12 ↓ 13 Changed 13 years ago by hodgestar
The two patches look good to me and tests pass without any noticeable performance hit (I also see perhaps a slight performance gain although it's below the noise threshold). I tested on 2.5, 2.6, 2.7 and 3.2. I'm happy for the patches to be commit and back-ported to 0.6.x.
Do you have commit access or shall I commit them?
comment:10 in reply to: ↑ 9 Changed 13 years ago by hodgestar
P.S. What is test_ignorable_space for? More tests are obviously good but is it related to this issue? Also, should we add a comment referencing this ticket to the test in case someone wants to refer back to the issue?
comment:11 Changed 13 years ago by cboos
When I initially was looking at the WhitespaceFilter, I noticed there was no tests for ignorable space (i.e. space that would be removed by the filter), the related tests only focused on preserving the whitespace when appropriate. So, it's indeed unrelated, and I didn't intend to leave it in but simply didn't bother to repost when I saw it was still there in the patch ;-)
So feel free to commit the second patch without that test, and maybe add it separately in a third commit (no, I don't have commit access and I'm not yet familiar enough with some keys aspects of the code base to apply for).
comment:12 in reply to: ↑ 9 ; follow-up: ↓ 14 Changed 13 years ago by cboos
Replying to hodgestar:
I tested on ... 3.2.
Good to know I didn't break the py3k compatibility! Forgot to test this, will do in the future.
I'm happy for the patches to be commit and back-ported to 0.6.x.
Well, I could indeed have based them on the 0.6 branch. Forward porting should make things easier svn merge wise. I would suggest you switch to this model rather than the "backport" still prevalent so far, as this really makes maintenance easier.
comment:13 in reply to: ↑ 9 Changed 13 years ago by cboos
Replying to hodgestar:
tests pass without any noticeable performance hit (I also see perhaps a slight performance gain although it's below the noise threshold)
I think it should actually be faster now, but that will probably be more visible in Trac, especially when there are many Markup fragments, as we now output them directly instead of hashing / storing / retrieving them from that cache. That should largely compensate the cost of the refactoring (if there was any).
I will follow-up with some numbers...
comment:14 in reply to: ↑ 12 Changed 13 years ago by hodgestar
Replying to cboos:
Well, I could indeed have based them on the 0.6 branch. Forward porting should make things easier svn merge wise. I would suggest you switch to this model rather than the "backport" still prevalent so far, as this really makes maintenance easier.
I think having merges going in both directions would be worse so I plan to stick to the de facto backporting standard for the 0.6.x branch. I'll consider forward porting for 0.7.x.
comment:15 Changed 13 years ago by cboos
Performance tuning in Python is ... surprising at times. The refactoring patch had a performance hit, not compensated by the little savings of the next patch.
If I rewrite the refactoring in a slightly different way (t429-refactor-r1038.2.patch), then it doesn't introduce a slowdown.
For the present issue and the fix, it makes a big difference if the test is present in the _get method (like in t429-fix.patch) or inlined in the loop. I therefore provide a second patch (t429-fix.2.patch) for the faster version.
Note that the fix does provide a small speed-up, but it's only apparent when lots of data is serialized, i.e. it's not apparent when timing the tests alone.
Changed 13 years ago by cboos
second version, more efficient (no speed penalty after the refactoring)
comment:16 Changed 13 years ago by hodgestar
- Resolution set to fixed
- Status changed from assigned to closed
comment:17 Changed 13 years ago by cboos
Great, thanks for applying!
those tests pass on r1037 but test_cache_markup_strip_whitespace fails with r1038