Context Navigation

← Previous Ticket
Next Ticket →

#429 closed defect

Stream.render bug when strip_whitespace=False — at Version 5

Reported by:	cboos	Owned by:	hodgestar
Priority:	major	Milestone:	0.6.1
Component:	Serialization	Version:	0.6
Keywords:	cache	Cc:

Description (last modified by cboos)

For the context, I noticed that TracHacks:WikiGoodiesPlugin's ShowEntities macro was broken, as it displayed things like:

…

instead of the expected:

…

I traced that to the strip_whitespace parameter of Stream.render:

>>> from genshi.builder import tag
>>> from genshi.core import Markup
>>> table = tag.table(tag.tr(tag.th('&hellip;')), tag.td(Markup('&hellip;')))
>>> list(table.generate())
[('START', (QName('table'), Attrs()), (None, -1, -1)), ('START', (QName('tr'), Attrs()), (None, -1, -1)), ('START', (QName('th'), Attrs()), (None, -1, -1)), ('TEXT', '&hellip;', (None, -1, -1)), ('END', QName('th'), (None, -1, -1)), ('END', QName('tr'), (None, -1, -1)), ('START', (QName('td'), Attrs()), (None, -1, -1)), ('TEXT', <Markup u'&hellip;'>, (None, -1, -1)), ('END', QName('td'), (None, -1, -1)), ('END', QName('table'), (None, -1, -1))]
>>> table.generate().render('xml')
'<table><tr><th>&amp;hellip;</th></tr><td>&hellip;</td></table>'

All fine so far, but with strip_whitespace=False we get the unexpected:

>>> table.generate().render('xml', encoding=None, strip_whitespace=False)
u'<table><tr><th>&amp;hellip;</th></tr><td>&amp;hellip;</td></table>'

I didn't investigate further yet, what strip_whitespace=False does is to prevent the WhitespaceFilter to be added to the XMLSerializer filters. So the bug happens only when that filter is not present... strange.

Oh wait, this could be a problem with the string cache. I remember when reviewing [1038] that I was afraid that the cache wouldn't make a difference between escaped and unescaped strings, but I was not able to come up with a test case (maybe because I tested with the default strip_whitespace=True). To be continued...

Change History (6)

Changed 13 years ago by cboos

Attachment t429-regression-test-r1162.patch added

those tests pass on r1037 but test_cache_markup_strip_whitespace fails with r1038

comment:1 follow-up: ↓ 3 Changed 13 years ago by cboos

A simple fix would be:

genshi/output.py

         cache_get = cache.get
         if self.cache:
             def _emit(kind, input, output):
                 cache[kind, input] = output
+                cache[kind, input, isinstance(input, Markup)] = output
                 return output
         else:
             def _emit(kind, input, output):
 …
         for filter_ in self.filters:
             stream = filter_(stream)
         for kind, data, pos in stream:
             cached = cache_get((kind, data))
+            cached = cache_get((kind, data, isinstance(data, Markup)))
             if cached is not None:
                 yield cached

But there's much code duplication with the cache so I'll probably try to refactor things first.

comment:2 Changed 13 years ago by cboos

Keywords cache added

Oh, and the reason the presence of the WhitespaceFilter makes the test pass is because in my test the two TEXT tokens are accumulated in a single <Markup u'&hellip;…'> TEXT.

It's interesting to see what happens in the following test:

    def test_cache_markup(self):
        loc = (None, -1, -1)
        stream = Stream([(Stream.START, (QName('foo'), Attrs()), loc),
                         (Stream.TEXT, u'&hellip;', loc),
                         (Stream.END, QName('foo'), loc),
                         (Stream.START, (QName('bar'), Attrs()), loc),
                         (Stream.TEXT, Markup('&hellip;'), loc),
                         (Stream.END, QName('bar'), loc)])
        output = stream.render(XMLSerializer, encoding=None, 
                               strip_whitespace=False)
        self.assertEqual('<foo>&amp;hellip;</foo><bar>&hellip;</bar>', output)

Here, when strip_whitespace=False, the &hellip in <foo> will be seen as u'…' which hashes the same as the Markup('…') in <bar>. This explains why in <bar> we'll retrieve the previously cached output &hellip;!

OTOH, when strip_whitespace=True, the u'… in <foo> will be seen as <Markup u'&hellip;'>, and therefore we won't have a collision in the cache.

An alternative fix would be to use the escaped input as a key for the hash instead of the isinstance test, but it remains to be seen if this would be faster or not (escape also does a type(text) is cls).

comment:3 in reply to: ↑ 1 ; follow-up: ↓ 4 Changed 13 years ago by hodgestar

Owner changed from cmlenz to hodgestar
Status changed from new to assigned

Replying to cboos:

A simple fix would be:

I'm happy to apply the simple fix (and tests) if you want but I'll wait for the potential refactor or a go-ahead from you.

But there's much code duplication with the cache so I'll probably try to refactor things first.

Wow. I see what you mean. On the other hand I don't see a simple way to refactor it without adding a lot more function calls.

Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.

comment:4 in reply to: ↑ 3 Changed 13 years ago by cboos

Replying to hodgestar:

Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.

While this will probably work in this specific situation, I would be very cautious with such a change:

there might places which rely on equality between Markup strings and unicode strings (much the same way as 'string' == u'string')
those very basic operations would have to be done in Python, instead of directly in C (those methods are for now inherited from unicode); this might have a non negligible performance impact overall

If b. happens to not be a blocker, then this approach would at least need to wait for a major version change (0.7) because of a.

The simpler fix OTOH could make it for 0.6.1.

comment:5 Changed 13 years ago by cboos

Description modified (diff)

Note: See TracTickets for help on using tickets.

Download in other formats: