Edgewall Software

Opened 13 years ago

Closed 13 years ago

Last modified 13 years ago

#429 closed defect (fixed)

Stream.render bug when strip_whitespace=False

Reported by: cboos Owned by: hodgestar
Priority: major Milestone: 0.6.1
Component: Serialization Version: 0.6
Keywords: cache Cc:

Description (last modified by cboos)

For the context, I noticed that TracHacks:WikiGoodiesPlugin's ShowEntities macro was broken, as it displayed things like:

… …

instead of the expected:

…

I traced that to the strip_whitespace parameter of Stream.render:

>>> from genshi.builder import tag
>>> from genshi.core import Markup
>>> table = tag.table(tag.tr(tag.th('…')), tag.td(Markup('…')))
>>> list(table.generate())
[('START', (QName('table'), Attrs()), (None, -1, -1)), ('START', (QName('tr'), Attrs()), (None, -1, -1)), ('START', (QName('th'), Attrs()), (None, -1, -1)), ('TEXT', '&hellip;', (None, -1, -1)), ('END', QName('th'), (None, -1, -1)), ('END', QName('tr'), (None, -1, -1)), ('START', (QName('td'), Attrs()), (None, -1, -1)), ('TEXT', <Markup u'&hellip;'>, (None, -1, -1)), ('END', QName('td'), (None, -1, -1)), ('END', QName('table'), (None, -1, -1))]
>>> table.generate().render('xml')
'<table><tr><th>&amp;hellip;</th></tr><td>&hellip;</td></table>'

All fine so far, but with strip_whitespace=False we get the unexpected:

>>> table.generate().render('xml', encoding=None, strip_whitespace=False)
u'<table><tr><th>&amp;hellip;</th></tr><td>&amp;hellip;</td></table>'

I didn't investigate further yet, what strip_whitespace=False does is to prevent the WhitespaceFilter to be added to the XMLSerializer filters. So the bug happens only when that filter is not present... strange.

Oh wait, this could be a problem with the string cache. I remember when reviewing [1038] that I was afraid that the cache wouldn't make a difference between escaped and unescaped strings, but I was not able to come up with a test case (maybe because I tested with the default strip_whitespace=True). To be continued...

Attachments (5)

t429-regression-test-r1162.patch (2.0 KB) - added by cboos 13 years ago.
those tests pass on r1037 but test_cache_markup_strip_whitespace fails with r1038
t429-refactor-r1038.patch (5.9 KB) - added by cboos 13 years ago.
first refactor r1038... (applies on r1162)
t429-fix.patch (2.5 KB) - added by cboos 13 years ago.
... then fix this issue (applies on t429-refactor-r1038.patch)
t429-refactor-r1038.2.patch (5.8 KB) - added by cboos 13 years ago.
second version, more efficient (no speed penalty after the refactoring)
t429-fix.2.patch (3.8 KB) - added by cboos 13 years ago.
second version, test inlined

Download all attachments as: .zip

Change History (22)

Changed 13 years ago by cboos

those tests pass on r1037 but test_cache_markup_strip_whitespace fails with r1038

comment:1 follow-up: Changed 13 years ago by cboos

A simple fix would be:

  • genshi/output.py

     
    212212        cache_get = cache.get
    213213        if self.cache:
    214214            def _emit(kind, input, output):
    215                 cache[kind, input] = output
     215                cache[kind, input, isinstance(input, Markup)] = output
    216216                return output
    217217        else:
    218218            def _emit(kind, input, output):
     
    221221        for filter_ in self.filters:
    222222            stream = filter_(stream)
    223223        for kind, data, pos in stream:
    224             cached = cache_get((kind, data))
     224            cached = cache_get((kind, data, isinstance(data, Markup)))
    225225            if cached is not None:
    226226                yield cached

But there's much code duplication with the cache so I'll probably try to refactor things first.

comment:2 Changed 13 years ago by cboos

  • Keywords cache added

Oh, and the reason the presence of the WhitespaceFilter makes the test pass is because in my test the two TEXT tokens are accumulated in a single <Markup u'&amp;hellip;&hellip;'> TEXT.

It's interesting to see what happens in the following test:

    def test_cache_markup(self):
        loc = (None, -1, -1)
        stream = Stream([(Stream.START, (QName('foo'), Attrs()), loc),
                         (Stream.TEXT, u'&hellip;', loc),
                         (Stream.END, QName('foo'), loc),
                         (Stream.START, (QName('bar'), Attrs()), loc),
                         (Stream.TEXT, Markup('&hellip;'), loc),
                         (Stream.END, QName('bar'), loc)])
        output = stream.render(XMLSerializer, encoding=None, 
                               strip_whitespace=False)
        self.assertEqual('<foo>&amp;hellip;</foo><bar>&hellip;</bar>', output)

Here, when strip_whitespace=False, the &hellip in <foo> will be seen as u'&hellip;' which hashes the same as the Markup('&hellip;') in <bar>. This explains why in <bar> we'll retrieve the previously cached output &amp;hellip;!

OTOH, when strip_whitespace=True, the u'&hellip; in <foo> will be seen as <Markup u'&amp;hellip;'>, and therefore we won't have a collision in the cache.

An alternative fix would be to use the escaped input as a key for the hash instead of the isinstance test, but it remains to be seen if this would be faster or not (escape also does a type(text) is cls).

comment:3 in reply to: ↑ 1 ; follow-up: Changed 13 years ago by hodgestar

  • Owner changed from cmlenz to hodgestar
  • Status changed from new to assigned

Replying to cboos:

A simple fix would be:

I'm happy to apply the simple fix (and tests) if you want but I'll wait for the potential refactor or a go-ahead from you.

But there's much code duplication with the cache so I'll probably try to refactor things first.

Wow. I see what you mean. On the other hand I don't see a simple way to refactor it without adding a lot more function calls.

Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.

comment:4 in reply to: ↑ 3 Changed 13 years ago by cboos

Replying to hodgestar:

Another option might be to change Markup instances so that they don't hash/compare equal to the associated string (since they're not). I can create a patch for that if you concur.

While this will probably work in this specific situation, I would be very cautious with such a change:

  1. there might places which rely on equality between Markup strings and unicode strings (much the same way as 'string' == u'string')
  2. those very basic operations would have to be done in Python, instead of directly in C (those methods are for now inherited from unicode); this might have a non negligible performance impact overall

If b. happens to not be a blocker, then this approach would at least need to wait for a major version change (0.7) because of a.

The simpler fix OTOH could make it for 0.6.1.

comment:5 Changed 13 years ago by cboos

  • Description modified (diff)

comment:6 Changed 13 years ago by cboos

Wait a minute... in the case of TEXT tokens, we always have output == input for Markup instances, don't we?

In this case, the fix becomes:

  • output.py

     
    221221        for filter_ in self.filters:
    222222            stream = filter_(stream)
    223223        for kind, data, pos in stream:
     224            if kind is TEXT and isinstance(data, Markup):
     225                yield data
     226                continue
    224227            cached = cache_get((kind, data))
    225228            if cached is not None:
    226229                yield cached

comment:7 Changed 13 years ago by cboos

oops, excuse the Trac 0.12ism ;-)

  • output.py

     
    221221        for filter_ in self.filters:
    222222            stream = filter_(stream)
    223223        for kind, data, pos in stream:
     224            if kind is TEXT and isinstance(data, Markup):
     225                yield data
     226                continue
    224227            cached = cache_get((kind, data))
    225228            if cached is not None:
    226229                yield cached

Changed 13 years ago by cboos

first refactor r1038... (applies on r1162)

Changed 13 years ago by cboos

... then fix this issue (applies on t429-refactor-r1038.patch)

comment:8 Changed 13 years ago by cboos

Please review the two patches above.

In my limited performance testing (min time for running the tests), the timing didn't change much, it even seemed a little bit faster with the patches... I would be interested in your numbers.

comment:9 follow-ups: Changed 13 years ago by hodgestar

The two patches look good to me and tests pass without any noticeable performance hit (I also see perhaps a slight performance gain although it's below the noise threshold). I tested on 2.5, 2.6, 2.7 and 3.2. I'm happy for the patches to be commit and back-ported to 0.6.x.

Do you have commit access or shall I commit them?

comment:10 in reply to: ↑ 9 Changed 13 years ago by hodgestar

P.S. What is test_ignorable_space for? More tests are obviously good but is it related to this issue? Also, should we add a comment referencing this ticket to the test in case someone wants to refer back to the issue?

comment:11 Changed 13 years ago by cboos

When I initially was looking at the WhitespaceFilter, I noticed there was no tests for ignorable space (i.e. space that would be removed by the filter), the related tests only focused on preserving the whitespace when appropriate. So, it's indeed unrelated, and I didn't intend to leave it in but simply didn't bother to repost when I saw it was still there in the patch ;-)

So feel free to commit the second patch without that test, and maybe add it separately in a third commit (no, I don't have commit access and I'm not yet familiar enough with some keys aspects of the code base to apply for).

comment:12 in reply to: ↑ 9 ; follow-up: Changed 13 years ago by cboos

Replying to hodgestar:

I tested on ... 3.2.

Good to know I didn't break the py3k compatibility! Forgot to test this, will do in the future.

I'm happy for the patches to be commit and back-ported to 0.6.x.

Well, I could indeed have based them on the 0.6 branch. Forward porting should make things easier svn merge wise. I would suggest you switch to this model rather than the "backport" still prevalent so far, as this really makes maintenance easier.

comment:13 in reply to: ↑ 9 Changed 13 years ago by cboos

Replying to hodgestar:

tests pass without any noticeable performance hit (I also see perhaps a slight performance gain although it's below the noise threshold)

I think it should actually be faster now, but that will probably be more visible in Trac, especially when there are many Markup fragments, as we now output them directly instead of hashing / storing / retrieving them from that cache. That should largely compensate the cost of the refactoring (if there was any).

I will follow-up with some numbers...

comment:14 in reply to: ↑ 12 Changed 13 years ago by hodgestar

Replying to cboos:

Well, I could indeed have based them on the 0.6 branch. Forward porting should make things easier svn merge wise. I would suggest you switch to this model rather than the "backport" still prevalent so far, as this really makes maintenance easier.

I think having merges going in both directions would be worse so I plan to stick to the de facto backporting standard for the 0.6.x branch. I'll consider forward porting for 0.7.x.

comment:15 Changed 13 years ago by cboos

Performance tuning in Python is ... surprising at times. The refactoring patch had a performance hit, not compensated by the little savings of the next patch.

If I rewrite the refactoring in a slightly different way (t429-refactor-r1038.2.patch), then it doesn't introduce a slowdown.

For the present issue and the fix, it makes a big difference if the test is present in the _get method (like in t429-fix.patch) or inlined in the loop. I therefore provide a second patch (t429-fix.2.patch) for the faster version.

Note that the fix does provide a small speed-up, but it's only apparent when lots of data is serialized, i.e. it's not apparent when timing the tests alone.

Changed 13 years ago by cboos

second version, more efficient (no speed penalty after the refactoring)

Changed 13 years ago by cboos

second version, test inlined

comment:16 Changed 13 years ago by hodgestar

  • Resolution set to fixed
  • Status changed from assigned to closed

Thanks for all the patches! Refactoring patch committed to trunk in r1163 and fix in r1164. Both backported to the 0.6.x branch in r1165.

comment:17 Changed 13 years ago by cboos

Great, thanks for applying!

Note: See TracTickets for help on using tickets.