Edgewall Software

Opened 13 years ago

Last modified 13 years ago

#429 closed defect

Stream.render bug when strip_whitespace=False — at Version 5

Reported by: cboos Owned by: hodgestar
Priority: major Milestone: 0.6.1
Component: Serialization Version: 0.6
Keywords: cache Cc:

Description (last modified by cboos)

For the context, I noticed that TracHacks:WikiGoodiesPlugin's ShowEntities macro was broken, as it displayed things like:

… …

instead of the expected:

…

I traced that to the strip_whitespace parameter of Stream.render:

>>> from genshi.builder import tag
>>> from genshi.core import Markup
>>> table = tag.table(tag.tr(tag.th('…')), tag.td(Markup('…')))
>>> list(table.generate())
[('START', (QName('table'), Attrs()), (None, -1, -1)), ('START', (QName('tr'), Attrs()), (None, -1, -1)), ('START', (QName('th'), Attrs()), (None, -1, -1)), ('TEXT', '&hellip;', (None, -1, -1)), ('END', QName('th'), (None, -1, -1)), ('END', QName('tr'), (None, -1, -1)), ('START', (QName('td'), Attrs()), (None, -1, -1)), ('TEXT', <Markup u'&hellip;'>, (None, -1, -1)), ('END', QName('td'), (None, -1, -1)), ('END', QName('table'), (None, -1, -1))]
>>> table.generate().render('xml')
'<table><tr><th>&amp;hellip;</th></tr><td>&hellip;</td></table>'

All fine so far, but with strip_whitespace=False we get the unexpected:

>>> table.generate().render('xml', encoding=None, strip_whitespace=False)
u'<table><tr><th>&amp;hellip;</th></tr><td>&amp;hellip;</td></table>'

I didn't investigate further yet, what strip_whitespace=False does is to prevent the WhitespaceFilter to be added to the XMLSerializer filters. So the bug happens only when that filter is not present... strange.

Oh wait, this could be a problem with the string cache. I remember when reviewing [1038] that I was afraid that the cache wouldn't make a difference between escaped and unescaped strings, but I was not able to come up with a test case (maybe because I tested with the default strip_whitespace=True). To be continued...

Change History (6)

Changed 13 years ago by cboos

those tests pass on r1037 but test_cache_markup_strip_whitespace fails with r1038

comment:1 follow-up: Changed 13 years ago by cboos

A simple fix would be:

  • genshi/output.py

     
    212212        cache_get = cache.get
    213213        if self.cache:
    214214            def _emit(kind, input, output):
    215                 cache[kind, input] = output
     215                cache[kind, input, isinstance(input, Markup)] = output
    216216                return output
    217217        else:
    218218            def _emit(kind, input, output):
     
    221221        for filter_ in self.filters:
    222222            stream = filter_(stream)
    223223        for kind, data, pos in stream:
    224             cached = cache_get((kind, data))
     224            cached = cache_get((kind, data, isinstance(data, Markup)))
    225225            if cached is not None:
    226226                yield cached

But there's much code duplication with the cache so I'll probably try to refactor things first.

comment:2 Changed 13 years ago by cboos

  • Keywords cache added

Oh, and the reason the presence of the WhitespaceFilter makes the test pass is because in my test the two TEXT tokens are accumulated in a single <Markup u'&amp;hellip;&hellip;'> TEXT.

It's interesting to see what happens in the following test:

    def test_cache_markup(self):
        loc = (None, -1, -1)
        stream = Stream([(Stream.START, (QName('foo'), Attrs()), loc),
                         (Stream.TEXT, u'&hellip;', loc),
                         (Stream.END, QName('foo'), loc),
                         (Stream.START, (QName('bar'), Attrs()), loc),
                         (Stream.TEXT, Markup('&hellip;'), loc),
                         (Stream.END, QName('bar'), loc)])
        output = stream.render(XMLSerializer, encoding=None, 
                               strip_whitespace=False)
        self.assertEqual('<foo>&amp;hellip;</foo><bar>&hellip;</bar>', output)

Here, when strip_whitespace=False, the &hellip in <foo> will be seen as u'&hellip;' which hashes the same as the Markup('&hellip;') in <bar>. This explains why in <bar> we'll retrieve the previously cached output &amp;hellip;!

OTOH, when strip_whitespace=True, the u'&hellip; in <foo> will be seen as <Markup u'&amp;hellip;'>, and therefore we won't have a collision in the cache.

An alternative fix would be to use the escaped input as a key for the hash instead of the isinstance test, but it remains to be seen if this would be faster or not (escape also does a type(text) is cls).

comment:3 in reply to: ↑ 1 ; follow-up: Changed 13 years ago by hodgestar

  • Owner changed from cmlenz to hodgestar
  • Status changed from new to assigned

Replying to cboos:

A simple fix would be:

I'm happy to apply the simple fix (and tests) if you want but I'll wait for the potential refactor or a go-ahead from you.

But there's much code duplication with the cache so I'll probably try to refactor things first.

Wow. I see what you mean. On the other hand I don't see a simple way to refactor it without adding a lot more function calls.

Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.

comment:4 in reply to: ↑ 3 Changed 13 years ago by cboos

Replying to hodgestar:

Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.

While this will probably work in this specific situation, I would be very cautious with such a change:

  1. there might places which rely on equality between Markup strings and unicode strings (much the same way as 'string' == u'string')
  2. those very basic operations would have to be done in Python, instead of directly in C (those methods are for now inherited from unicode); this might have a non negligible performance impact overall

If b. happens to not be a blocker, then this approach would at least need to wait for a major version change (0.7) because of a.

The simpler fix OTOH could make it for 0.6.1.

comment:5 Changed 13 years ago by cboos

  • Description modified (diff)
Note: See TracTickets for help on using tickets.