Opened 14 years ago
Last modified 8 years ago
#423 assigned defect
Genshi should not deduce plural/singular by sending garbage strings to ngettext
Reported by: | anonymous | Owned by: | hodgestar |
---|---|---|---|
Priority: | major | Milestone: | 0.9 |
Component: | Internationalization | Version: | 0.6 |
Keywords: | choose singular plural | Cc: | dfraser, palgarvio |
Description
Currently genshi i18n filter deduces whether a form is pluralized by sending garbage to ngettext. However, a GNU conforming ngettext uses only the singular form to do lookups from hash table. The ngettext interface is designed to be used for English only; in case of no translation found, ngettext is supposed to return the first string if and only if n == 1. To do anything otherwise is just b0rken.
Sending garbage strings also makes debugging real i18n problems harder.
Attachments (2)
Change History (29)
Changed 14 years ago by antti@…
comment:1 Changed 14 years ago by hodgestar
- Owner changed from cmlenz to hodgestar
- Status changed from new to assigned
comment:2 in reply to: ↑ description Changed 14 years ago by cboos
- Component changed from General to Internationalization
Replying to anonymous:
The ngettext interface is designed to be used for English only
Well, it's rather that claim which is bogus...
ngettext is supposed to return the first string if and only if n == 1.
Ditto. If testing for plural would be a simple matter of checking numeral != 1 as you do in your patch, life would be easier ;-)
Sending garbage strings also makes debugging real i18n problems harder.
I don't claim that doing things like they are done currently is the best possible solution, but oversimplifying the problem is not the way to go either.
comment:3 Changed 14 years ago by cboos
comment:4 Changed 14 years ago by dfraser
- Cc dfraser added
comment:5 follow-up: ↓ 6 Changed 14 years ago by anonymous
The original code does not test if a plural would be used in the target language, only whether ngettext would use msgid1 or msgid2 for the given plural. If the ngettext implementation is conforming it will return msgid1 when n == 1, msgid2 otherwise.
As the gettext documentation says, "This has the consequence that programs without language catalogs can display the correct strings only if the program itself is written using a Germanic language. This is a limitation but since the GNU C library (as well as the GNU gettext package) are written as part of the GNU package and the coding standards for the GNU project require program being written in English, this solution nevertheless fulfills its purpose."
comment:6 in reply to: ↑ 5 ; follow-up: ↓ 13 Changed 14 years ago by antti@…
Further clarification
Solaris:
"The ngettext(), dngettext(), and dcngettext() functions return the message string if the search succeeds. If the search fails, msgid1 is returned if n == 1. Otherwise msgid2 is returned."
Glibc:
"If a translation was found in one of the specified catalogs, the appropriate plural form is converted to the locale's codeset and returned. The resulting string is statically allocated and must not be modified or freed. Otherwise msgid or msgid_plural is returned, as described above."
As the original comment says, there should be no match for the given msgids in any catalogues, thus a conforming ngettext returns msgid2/msgid_plural exactly when n != 1, no matter which locale, translation settings, loaded message catalogues or so we are using.
comment:7 follow-up: ↓ 11 Changed 14 years ago by hodgestar
I think antti is correct. The current is_plural implementation sends in a singular and plural form designed to not be present in any catalogue. If the catalogue search fails conforming ngettext implementations should return msgid1 if n == 1 or msgid2 otherwise. Hence the old function and the new function are identical for conforming ngettext functions.
comment:8 Changed 14 years ago by hodgestar
One could claim that antti's initial problems were caused by having a broken ngettext but it seems a shame to do a search in the translation catalogue that is designed to fail when there is a much simpler route to the correct answer.
comment:9 Changed 14 years ago by hodgestar
There is also a question in my mind about whether non-conforming ngettext implementations are useful to some people? E.g. someone writing code in language which has two plural forms but which are used for different values of n to the English ones.
comment:10 Changed 14 years ago by cboos
Well, I still disagree: the original code does test if a plural was actually used in the target language.
In the following code:
479 def _is_plural(self, numeral, ngettext): 480 # XXX: should we test which form was chosen like this!?!?!? 481 # There should be no match in any catalogue for these singular and 482 # plural test strings 483 singular = u'O\x85\xbe\xa9\xa8az\xc3?\xe6\xa1\x02n\x84\x93' 484 plural = u'\xcc\xfb+\xd3Pn\x9d\tT\xec\x1d\xda\x1a\x88\x00' 485 return ngettext(singular, plural, numeral) == plural
What happens is the following:
- the proper plural form will be computed from the numeral value,
- when trying to retrieve the corresponding plural form from the catalog, none will be found, either singular or plural will be returned
- if it's the plural value, we know the plural form was not 0, and well, that _is_plural should return True
Note that this is different from the situation described here:
If no message catalog is found msgid1 is returned if n == 1, otherwise msgid2.
as in our case we (normally) have a message catalog, and therefore a "Plural-Forms:" entry.
Take the following interactive session as an example:
- we start in a Trac checkout where the compiled catalogs are below trac/locale
- we pick the Romanian locale, where we have "Plural-Forms: nplurals=3; plural=n==1 ? 0 : (n==0 || (n%100 > 0 && n%100 < 20)) ? 1 : 2;\n")
from babel.support import Translations ro = Translations.load('trac/locale', ['ro']) ro.ngettext(0, 'singular', 'plural') >>> ro.ngettext('singular', 'plural', 0) 'plural' >>> ro.ngettext('singular', 'plural', 1) 'singular' >>> ro.ngettext('singular', 'plural', 2) 'plural' >>>
Where I agree with you is that the current way ChooseDirective._is_plural does this is a kind of a hack, as acknowledged by the original comment. Maybe we could make available some kind of Translations.pluralform(n) method, but the advantage of the current method is that it will work with any implementation:
>>> import gettext >>> ro = gettext.translation('messages', 'trac/locale', ['ro']) >>> ro <gettext.GNUTranslations instance at 0x02BB4670> >>> [ro.ngettext('singular', 'plural', numeral) for numeral in (0, 1, 2)] ['plural', 'singular', 'plural']
comment:11 in reply to: ↑ 7 Changed 14 years ago by cboos
Replying to hodgestar:
... If the catalogue search fails conforming ngettext implementations should return msgid1 if n == 1 or msgid2 otherwise. Hence the old function and the new function are identical for conforming ngettext functions.
It all depends what you mean by if the catalogue search fails:
- if no catalog file is found, then sure, same thing. But that's not the usual situation.
- if msgid1 is not found in the catalog, then definitely no, see my examples in comment:10 (and if you know other implementations than the default gettext.GNUTranslations and Babel's babel.support.Translations let me know and we'll test those as well)
comment:13 in reply to: ↑ 6 Changed 14 years ago by cboos
Replying to antti@…:
... Glibc:
"If a translation was found in one of the specified catalogs, the appropriate plural form is converted to the locale's codeset and returned. The resulting string is statically allocated and must not be modified or freed. Otherwise msgid or msgid_plural is returned, as described above."
Ok, but glibc seems to actually behave like its Python counterpart as shown in the example from comment:10. Here using cygwin gettext tools (ngettext (GNU gettext-runtime) 0.17):
$ TEXTDOMAIN=trac/locale LANG=ro ngettext.exe singular plural 0 plural $ TEXTDOMAIN=trac/locale LANG=ro ngettext.exe singular plural 1 singular $ TEXTDOMAIN=trac/locale LANG=ro ngettext.exe singular plural 2 plural
(and no, no "singular" string to be found in that catalog)
I think that this behavior is OK, so I wouldn't call it a bug, but rather that the documentation should be updated.
comment:14 Changed 14 years ago by cboos
Ok ok, wait... how I manage to convince myself by picking the wrong example ;-)
comment:15 Changed 14 years ago by cboos
Redoing all of the above with a correct example (e.g. pt_BR which has "Plural-Forms: nplurals=2; plural=(n > 1)\n" and hence actually differs from the default (n != 1) for (0, 1, 2), contrary to the Romanian one), I actually confirm the reported non-intuitive behavior. Damn.
comment:16 Changed 14 years ago by hodgestar
If we're all in agreement I'll commit a version of antti's patch with a comment in the code summarising the discussion and linking to this ticket?
comment:17 follow-up: ↓ 18 Changed 14 years ago by cboos
Yes, the patch is definitely an improvement, as it makes clear what happens.
Maybe the i18n:choose directive documentation should be clarified accordingly (e.g. ... an integer which will allow to choose the plural/singular form. -> ... an integer which will allow to choose the plural/singular form according to the English language plural rule (i.e. singular form corresponds to the value 1 exclusively).)
... and apologies for the noise, I really should have been more careful in my tests.
comment:18 in reply to: ↑ 17 Changed 14 years ago by hodgestar
Replying to cboos:
... and apologies for the noise, I really should have been more careful in my tests.
I certainly didn't have the full picture at the start and every experiment should have a control. :)
comment:19 Changed 14 years ago by anonymous
"Maybe the i18n:choose directive documentation should be clarified accordingly (e.g. ... an integer which will allow to choose the plural/singular form. -> ... an integer which will allow to choose the plural/singular form according to the English language plural rule (i.e. singular form corresponds to the value 1 exclusively)."
Hmm, it should specifically say something like "If a translation for the message is provided in the catalogs, an appropriate grammatical form is chosen for the given number. However, if the message is not translated in message catalogs, the i18n:singular or i18n:plural form as provided in the template itself will be chosen, according to the English language plural rule (i.e. singular form corresponds to the value 1 exclusively)."
comment:20 Changed 14 years ago by cboos
- Keywords choose singular plural added
I think some further analysis is still needed.
The main problem with the ChooseDirective is that it has to combine the singular or plural msgid translation (the msgstr as produced by ngettext) with the singular or plural substreams (in msgbuf.translate(), see source:trunk/genshi/filters/i18n.py@1158:402-422#L376).
The initial code was written under the assumption that the selection between the i18n:singular or i18n:plural substreams would be done according to the same "Plural-Form:" rule of the target language than the one used to select between the msgstrs. Hence the complicated "fake" lookup done in _is_plural which I believe was expected to work the way I described it in comment:10.
But as antti discovered, it actually doesn't work that way, and in fact all what the code did was to select between singular and plural substreams according to the English rule (n != 1). Therefore antti was right in proposing his patch as a simpler equivalent replacement of the current behavior.
Now the question remains whether this is really the right thing to do, as I believe this can lead to combine a singular translated msgstr with the plural substream, or conversely a plural msgstr with the singular substream. Not necessarily always a problem, but I believe this can lead to subtle bugs.
(as I was already dead wrong once, take the above as my possibly biased reading of the code, until I come up with a test case)
comment:21 Changed 14 years ago by antti@…
[ sorry for posting some comments as anon, saved my prefs now ]
That is a good point. However, I have to point out, that if you i18n:choose between <i18n:singular>Exactly one file was removed</i18n:singular> and <i18n:plural><span>${number}</span> files were removed</i18n:plural>, you also will shoot yourself in the foot, as in, say French, you have to put the number in there; likewise, Chinese languages use only 1 form for everything; this would be considered "singular" by any test, so things would get ugly anyway...
Thus I propose that it could be documented that the i18n:singular substream is used for catalogue translations if they exist. Furthermore, it could work so, that if ngettext(msgid, msgid_plural, n) produces msgid, i18n:singular would be used directly, if it produces msgid_plural, you would then use i18n:plural contents as such, and if neither match, then map the returned message string to the contents of i18n:singular...
comment:22 follow-up: ↓ 23 Changed 14 years ago by cboos
Exactly. I think the original intent of the code (Pedro, I'd like you to confirm this ;-) ) was that Genshi should match singular msgstr with singular stream, plural msgstr with plural stream. But this leads to all sorts of problems if the streams don't have the same structure, as you mentioned above.
Therefore the documentation should make it clear that i18n:choose can't be (ab)used for trying to generate two different kinds of content in a one vs. many cases (like for example, a single element in the i18n:singular directive and a list or a table in the i18n:plural directive). Rather, it should simply be used for picking the correct grammatical form of a text which must have in both cases the same structure.
The rule you suggested should work just fine.
comment:23 in reply to: ↑ 22 Changed 14 years ago by palgarvio
Replying to cboos:
Exactly. I think the original intent of the code (Pedro, I'd like you to confirm this ;-) ) was that Genshi should match singular msgstr with singular stream, plural msgstr with plural stream. But this leads to all sorts of problems if the streams don't have the same structure, as you mentioned above.
Yes, that was the intention. Been away from it for quite a while, can't remember why that was necessary. Sorry, my time currently is scarce.
comment:24 Changed 14 years ago by hodgestar
I think we need to re-write the choose directive to avoid the use of is_plural. If no one volunteers I'll have a go at it (with the caveat that I'm very new to i18n / gettext).
comment:25 Changed 14 years ago by hodgestar
After discussion with cmlenz and s0undt3ch on #genshi I've created a patch that removes the use of is_plural from i18n:choose. This changes the behaviour of i18n:choose so that:
- The singular form defines the structure of the XML stream returned.
- The plural form is merely a convenience for constructing the fallback plural form passed to ngettext.
An unexpected consequence of this is that it is no longer possible to translate HTML attribute values even in the case of languages with English-like plural forms (it wasn't possible for other languages already). A good example is a title attribute. To fix this we'd need to work attribute values into the formatted text sent to ngettext somehow. This wouldn't make sense for all attribute values, only things like titles and alt texts, I think.
comment:26 Changed 14 years ago by hodgestar
I should mention that four tests fail. Two are cases where the singular form has py:strip attributes and the plural doesn't. Two are cases where title attributes are incorrectly translated.
comment:27 Changed 8 years ago by hodgestar
- Milestone changed from 0.7 to 0.9
Move to milestone 0.9.
fix for garbage strings problem...