Page MenuHomePhabricator

Investigate translation issues with pluralised strings [15hr]
Closed, ResolvedPublicSpike

Description

In the translation step of our builds we have a number of error messages, usually related to variables not being present for pluralised variables.

See the latest master build, for example: https://travis-ci.com/github/WikipediaLibrary/TWLight/builds/226694645

This might be stopping the latest translations making it into production, per T283222.

Questions

  • Why are these errors common? Is this a problem with how the strings are presented in TranslateWiki, or a technical issue?
  • Are these errors the reason that the latest Chinese language translations aren't making it to production, or another issue?

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMay 24 2021, 1:26 PM
Restricted Application added subscribers: Sadads, Aklapper. · View Herald Transcript
Samwalton9-WMF renamed this task from [SPIKE] Investigate translation issues with pluralised strings to Investigate translation issues with pluralised strings [2hr].May 24 2021, 1:26 PM
Samwalton9-WMF triaged this task as Medium priority.

From a cursory glance at TranslateWiki, it looks like we might not be pulling through the pluralised version of some strings, see the difference between the zh-hans TranslateWiki entry and what was ultimately added to the django.po file:

  • TranslateWiki: {{PLURAL:GETTEXT|一个待处理的申请。|%(counter)s个待处理的申请。}}
  • django.po:
msgid "One pending application."
msgid_plural "%(counter)s pending applications."
msgstr[0] "一个待处理的申请。"

The second half of the translation seems to be missing, which prompts the error message?

https://translatewiki.net/w/i.php?title=Special:Translate&showMessage=wikipedia-library-7e8041-%3D7B%3D7BPLURAL%3AGETTEXT%3D7C%3D25%28approved_co&group=wikipedia-library-website&language=zh-hans&filter=&optional=1&action=translate

{{PLURAL:GETTEXT|%(approved_count)s approved applications.|%(counter)s approved applications.}}

zh-hans:

{{PLURAL:GETTEXT|%(approved_count)s个批准的申请。|%(counter)s个批准的申请。}}

One is "approved_count" and the other is "counter", is this correct?

That's correct, we should probably fix that because I'm pretty certain they're the same variable :)

在T283502#7107570中,@Samwalton9写道:
msgid "One pending application."
msgid_plural "%(counter)s pending applications."
msgstr[0] "一个待处理的申请。"

The second half of the translation seems to be missing, which prompts the error message?

See https://translatewiki.net/wiki/Special:ExportTranslations?group=wikipedia-library&language=zh-hans&format=export-to-file

The po file manual exported from TranslateWiki is missing the second half of the translation.

Chinese doesn't have multiple plural forms. It's a configuration issue on our side that our translation interface is not flagging that as an error for Chinese.

It seems the underlying error is that the variable name does not match in the singular and plural form for English.

Change 694485 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[translatewiki@master] Enhance validators for Wikipedia Library

https://gerrit.wikimedia.org/r/694485

Change 694485 merged by jenkins-bot:

[translatewiki@master] Enhance validators for Wikipedia Library

https://gerrit.wikimedia.org/r/694485

@Nikerabbit We got new translations merged earlier today but nothing substantial seems to have updated besides some new translations. Did you expect changes for pluralised strings in our repo based on your change, or is that a TranslateWiki interface update?

@Nikerabbit We got new translations merged earlier today but nothing substantial seems to have updated besides some new translations. Did you expect changes for pluralised strings in our repo based on your change, or is that a TranslateWiki interface update?

This is mostly on our side to prevent new problematic translation. No change expected for Chinese, as it doesn't have multiple plural forms.

Aside, the "empty" (no translations) po file for lv is causing some issues for us. Any idea why it got created?

Today some script generated another of these files without translations and with "fuzzy" in the header: https://github.com/WikipediaLibrary/TWLight/commit/9e6a2eb48c9040d694321c5fef5c1ec75954c6ef#diff-ddb974b473f1626643216959a21bf886e5bb4a0bf93af8b476e306979008ab3cR6

I need to drop Wikipedia Library from imports and exports until this is resolved.

Yes, those fuzzy flags are supposed to be getting dropped.
👀

@Nikerabbit it looks like we are dropping the fuzzy header from the header of all non-english po files. It looks like all our script did in your linked commit was split the messages that it thought were too long. I do see that we have some message strings marked as fuzzy, is that what you're talking about? Basically, I don't understand the problem well enough to make any changes to resolve it.

The file I linked (locale/it/LC_MESSAGES/django.po) is newly added in that commit. Apparently github anchors do not expand automatically, so you have to manually jump to that file.

I see what happened here. The it locale got added by translatewiki for the tag names using the json workflow and our script filled in the missing gettext files since the locale is now in use. That's definitely expected behavior on our end. Is the fuzzy header the blocking issue here? Right now our cleanup script just makes sure the flag doesn't get added on update, but I can update it to strip the fuzzy header from new gettext translation files too.

It is a problem for us insofar that our system tries to import the English texts as translations for those languages. I have a patch in review that should mitigate this issue, but there may be more complications to look into, such as unexpected number of plural forms.

I can do some testing on our end to see what the site will do if I suppress the creation of those empty translation files. It sounds like if we can roll without them, that will be the simplest fix.

Okay, I've verified that we can toss those files and the site will still work. It's a fairly straightforward change, I'll just need to make sure that our script compiles the new translations when they get added.

another issue I discovered: It looks like translatewiki hasn't been respecting our locale whitelist:
https://wikipedialibrary.wmflabs.org/i18n-whitelist
meaning that there have been some localizations created that we can't display.
However, we've done numerous upgrades since all of that logic was implemented, and I decided to retest. It looks like we can now display use locales as long as we have the files for them, even if they aren't in django core.
I'm going to update the platform to use those translations and drop that whitelist view altogether since it is now obsolete.

@Nikerabbit I updated the script to prevent it from creating new locales in this situation and deleted the empty it gettext files. I also dropped that whitelist and moved to accept all locales that we get in the locale directory. I think we're good to re-enable?

I'll check tomorrow before regular updates.

Update

I've done a fair amount of playing with this in the qqq namespace, and at least a subset of the errors are happening when django thinks the message should have separate singular/plural strings, even when it doesn't make sense.
For example, this throws an error:

13           {% blocktrans trimmed %}
14               Click 'confirm' to renew your application for {{ partner }}
15           {% endblocktrans %}

Execution of msgfmt failed: /app/locale/qqq/LC_MESSAGES/django.po:721: a format specification for argument 'partner' doesn't exist in 'msgstr'

whereas this does not:

13           {% blocktrans trimmed %}
14               Click 'confirm' to renew your application for {{ partner }}
15             {% plural %}
16               Click 'confirm' to renew your application for {{ partner }}
17           {% endblocktrans %}

At this point, I'm not really clear on why django thinks we need plural formats on these messages when we're not even using the ngettext function, but here we are. There are other messages that only have singular values and work just fine.

Here's what I've done so far to make qqq compile:
https://github.com/WikipediaLibrary/TWLight/compare/jason-poc-T283502
to be clear, I don't think this is the correct solution for this kind of problem, I'm just showing my work.

jsn.sherman renamed this task from Investigate translation issues with pluralised strings [2hr] to Investigate translation issues with pluralised strings [6hr].Jun 14 2021, 6:39 PM

Update:

In a fresh, empty django 3.1.12 project, this example problem template string does not cause any errors. I'm beginning to suspect that it may be a problem with some of the format specifications created by and for the translatewiki integration.

jsn.sherman renamed this task from Investigate translation issues with pluralised strings [6hr] to Investigate translation issues with pluralised strings [9hr].Jul 13 2021, 6:15 PM

I'm going to add back in progressively more bits from our real project until I see the problem.

I believe I've got to the bottom of this. First of all, things are going to mostly just work in English because that's what all our message strings and the tools themselves are geared towards.

There are a few underlying reasons for the problems here:

  • the qqq locale is setup with a non-english pluralization scheme: Plural-Forms: nplurals=1; plural=0; which basically means there is no difference between singular and plural. There may be a good reason for this configuration, but it does expose the issue when running the translation steps for this namespace.
  • We're kind of misusing blocktranslate in our templates. It looks like we should not include variables in these unless we want to feed values to be translated; the fact that our current setup works is kind of emergent behavior on the django side. It's the variables that are causing the errors. We've talked about splitting up some of these blocks before, and I think that could be a good choice when the block begins or ends with a variable. Eg, changing
13           {% blocktrans trimmed %}
14               Click 'confirm' to renew your application for {{ partner }}
15           {% endblocktrans %}

to

13           {% blocktrans trimmed %}
14               Click 'confirm' to renew your application for
15           {% endblocktrans %}
16           {{ partner }}

or

13           {% trans "Click 'confirm' to renew your application for" %} {{ partner }}

resolves the issue. Basically, we'd want to figure out the best way to note the sentence structure in the translation comment, since the translator would only see Click 'confirm' to renew your application for as the message string. I believe dealing with this kind of issue will fix most of the errors.

  • We may also have some legitimate pluralization problems. eg, places where the message might be singular or plural. In that case, adding the plural form of the impacted messages should resolve the issue.
jsn.sherman renamed this task from Investigate translation issues with pluralised strings [9hr] to Investigate translation issues with pluralised strings [12hr].Jul 14 2021, 4:35 PM

We're kind of misusing blocktranslate in our templates.

I'm a bit confused by this one. I thought one of the purposes of blocktranslate was to contain variables? Additionally, we can't know if the variable will appear at the end of the sentence in another language. In this example, the literal translation in another language might be more like "To renew your application for ..., click 'confirm'". It's not clear to me if/how this would work.

Yeah, I realized last night that I still had more investigation to do, which is why I just moved it back to in progress. You have posed exactly the right questions:

I'm a bit confused by this one. I thought one of the purposes of blocktranslate was to contain variables?

The django docs spend a lot of time talking about how to feed those blocktranslate variables values to the point that I took that as the only valid use case; eg. I became concerned that leaving them without some kind of placeholder was a problem; but it doesn't actually say that you shouldn't do what we're doing.
Avoiding our pattern does preclude the issue from arising though. Basically, this was a mis-assesment on my part, but it did cause me to become much more familiar with the ugettext/ngettext switchable machinery.

Additionally, we can't know if the variable will appear at the end of the sentence in another language.

Yep, this dawned on me last night, which is why I realized that we were mostly not going to be able to use my proposed workaround.

jsn.sherman renamed this task from Investigate translation issues with pluralised strings [12hr] to Investigate translation issues with pluralised strings [15hr].Jul 15 2021, 3:20 PM

the reason my various changes that forced pluralization handling changes fixed the errors is because it caused the messages to no longer match. I can see that there are several kinds of problems that all lead to very similar error output.

Improper handling of percentage signs in translations. This is a translatewiki issue
from: locale/bcl/LC_MESSAGES/django.po

#: TWLight/resources/templates/resources/partner_detail.html:182
#, python-format
msgid ""
"%(object)s allows a maximum of %(excerpt_limit)s words or "
"%(excerpt_limit_percentage)s%% of an article be excerpted into a Wikipedia "
"article."
msgstr ""
"%(object)s nagtutugot hanggang sa %(excerpt_limit)s tataramon o "
"%%(excerpt_limit_percentage)s kan artikulo na pwedeng isipi sa artikulo kan "
"Wikipedia."

results in this error:

Execution of msgfmt failed: /app/locale/bcl/LC_MESSAGES/django.po:1060: a format specification for argument 'excerpt_limit_percentage' doesn't exist in 'msgstr'

%(excerpt_limit_percentages)s should be followed by %% which escapes to the % in the message.

#: TWLight/resources/templates/resources/partner_detail.html:182
#, python-format
msgid ""
"%(object)s allows a maximum of %(excerpt_limit)s words or "
"%(excerpt_limit_percentage)s%% of an article be excerpted into a Wikipedia "
"article."
msgstr ""
"%(object)s nagtutugot hanggang sa %(excerpt_limit)s tataramon o "
"%(excerpt_limit_percentage)s%% kan artikulo na pwedeng isipi sa artikulo kan "
"Wikipedia."

which resolved the error in this case. Basically the translation needs to be fixed in translatewiki; I've already fixed some of these on the translatewiki side while trying to figure out where the issue is.

Improper handling of counters in templates. This is a TWLight issue.
from: TWLight/emails/templates/emails/coordinator_reminder_notification-body-html.html
The singular and plural must use the same variable for singular and plural. This is what the translatewiki folks were trying to tell us a while back, but the other issue was masking this during my investigation.

{% blocktrans count counter=approved_count trimmed %}
  {{ approved_count }} approved applications.
{% plural %}
  {{ counter }} approved applications.
{% endblocktrans %}

should be:

{% blocktrans count counter=approved_count trimmed %}
  {{ counter }} approved applications.
{% plural %}
  {{ counter }} approved applications.
{% endblocktrans %}

These are very easy to find and fix on our side, so I'm happy to open up a fix pr on a separate phab issue.

An aside:
We're going to continue to have errors on the qqq namespace because the way it's used looks exactly like the typo translation case. For example:

#. Translators: This message is displayed on the page where users can confirm their renewal. Please do not translate partner.
#: TWLight/applications/templates/applications/confirm_renewal.html:13
#, python-format
msgid "Click 'confirm' to renew your application for %(partner)s"
msgstr "\"Confirm\" is {{msg-wm|Wikipedia-library-04a212-Confirm}}"

Note that msgstr doesn't contain the %(partners)s variable. The django tooling treats this as a mistake, even though it doesn't necessarily make sense to expect equivalent content in qqq since it's used for documentation.
🤷

If it is not easy to exclude qqq from regular processing, and you don't need to update in via git, then we could just not export it and have it only managed in translatewiki.net.

Also, for the mistypes, our validation framework is able to prevent those from being created if it is properly configured.

In this case, it seems that sole % should be prevented if not followed by (. Is that right?

If it is not easy to exclude qqq from regular processing, and you don't need to update in via git, then we could just not export it and have it only managed in translatewiki.net.

I'd say lets just leave it for now. Our script is configured to continue on errors, so this just adds a little noise to our build logs. We can always silence errors for that namespace if it's bothersome in CICD.

In this case, it seems that sole % should be prevented if not followed by (. Is that right?

After some regex searching in our files that looks correct for messages. % should either be followed by ( or %.
The file headers are another story.

Per T286728#7220930 the initial errors with pluralised strings seem to be fixed. This has now uncovered a next level of errors in our translation files, which I believe I've summarised at T283222.

After some regex searching in our files that looks correct for messages. % should either be followed by ( or %.

@Nikerabbit Is this a validation you're able to add?

We're no longer seeing red errors (besides those in qqq - T287094) in our translation files.

Are these errors the reason that the latest Chinese language translations aren't making it to production, or another issue?

I've just tested zh-hans again as reported in T283222, and it seems that most content is still falling back to English despite a 100% translation. Is it obvious why this is?

After some regex searching in our files that looks correct for messages. % should either be followed by ( or %.

@Nikerabbit Is this a validation you're able to add?

It turns out that this needs additional development as none of the existing validators support this case, so it will take longer to implement.