Page MenuHomePhabricator

[GCI easy task] Find and fix syntax errors in translated messages
Open, LowPublic

Description

https://translatewiki.net/ contains source and translated messages for many projects, including MediaWiki and many extensions.

The source messages often contain syntax that must be preserved in the translated message, per localisation guidelines; checkers are designed to prevent mistakes.

Translators occasionally either omit or corrupt the syntax in the translation.

This task is to find and fix two translated messages in a Wikimedia repository that has incorrect syntax.

UI search

Manual search

To find syntax errors in translated messages, git clone the repository containing the translations.
Then use tools to look for problems. The simplest tools to use are generic text search programs like grep, or a Windows equivalent.

For example, git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git (for an anonymous checkout).
In the languages/i18n, read the en.json (English) and qqq.json (Description) files to learn about each message. Look for syntax which might become broken by a translator.

The variable syntax is $1, $2, etc. Sometimes translators add a space in the middle, like $ 1

$ git grep '\$ 1'
azb.json:       "blockedtext": "' 'ایستیفاده<U+200C>چی آدی و یا آی پی عنوانینیز قاباغی باغلانیب دیر.'\n\nسیزی باغلایان$ 1. الیله اولوب دیر \nباغلاماق سببی:' $ 2.\n\n* باغلانمانین باشلانان زامانی: $ 8\n* باغلانمانین قورتولان زامانی: $ 6\n* باغلانما مدتی: $ 7\n\nگؤستریلن سببه گؤره ائنگئللئنمئنیزین اویغون اولمادیغینی دوشونورسونوزسه، $ 1 یا دا باشقا بیر [[{{MediaWiki:Grouppage-sysop}}|مدیر]]  ایله بو وضعیتی گؤروشه بیلرسینیز. [[Special:Preferences|ترجیح لرینیز]] قیسمینده اعتبارلی بیر ائ-پوچت اونوانی گیرمئدیسئنیز \"ایستیفاده<U+200C>چییه ائ-پوچت گؤندر\" خصوصیتینی ایستیفاده ائده، ترجیهلرینیز ایمیل عنوانینیزی علاوه ایمیل گؤندرمک حقوقونا صاحب اولاجاقسینیز.\nبو آنکی باغلانما عنوانینیز $ 3، ائنگئللئنمئ نؤمره<U+200C>نیز # $ 5.\nبیر ایداره<U+200C>چی<U+200C>لر وضعیتینیز حاقیندا معلومات آلماق ایستدیگینیزده و یا هر هانسی بیر سورگودا بو معلومات<U+200C>لار لازیم اولا<U+200C>جاق، خاهیش ائدیریک نوت ائدین.",
azb.json:       "autoblockedtext": "\n' 'ایستیفاده<U+200C>چی آدی و یا آی پی عنوانینیز قاباغی باغلانیب دیر.'\n\nسیزی باغلایان$ 1. الیله اولوب دیر \nباغلاماق سببی:' $ 2.\n\n* باغلانمانین باشلانان زامانی: $ 8\n* باغلانمانین قورتولان زامانی: $ 6\n* باغلانما مدتی: $ 7\n\nگؤستریلن سببه گؤره ائنگئللئنمئنیزین اویغون اولمادیغینی دوشونورسونوزسه، $ 1 یا دا باشقا بیر [[{{MediaWiki:Grouppage-sysop}}|مدیر]]  ایله بو وضعیتی گؤروشه بیلرسینیز. [[Special:Preferences|ترجیح لرینیز]] قیسمینده اعتبارلی بیر ائ-پوچت اونوانی گیرمئدیسئنیز \"ایستیفاده<U+200C>چییه ائ-پوچت گؤندر\" خصوصیتینی ایستیفاده ائده، ترجیهلرینیز ایمیل عنوانینیزی علاوه ایمیل گؤندرمک حقوقونا صاحب اولاجاقسینیز.\nبو آنکی باغلانما عنوانینیز $ 3، ائنگئللئنمئ نؤمره<U+200C>نیز # $ 5.\nبیر ایداره<U+200C>چی<U+200C>لر وضعیتینیز حاقیندا معلومات آلماق ایستدیگینیزده و یا هر هانسی بیر سورگودا بو معلومات<U+200C>لار لازیم اولا<U+200C>جاق، خاهیش ائدیریک نوت ائدین.",
azb.json:       "file-info-png-repeat": "$1 {{PLURAL:$ 1|دفعه| دفعه}} اویناتیلدی",
khw.json:       "databaseerror-function": "فنکشن: $ 1",
khw.json:       "databaseerror-error": "خرابی: $ 1",
luz.json:       "copyright": "مطلب دومن $ 1 هس نکه خلاف هونو ذکر وابی.",
ses.json:       "hiddencategories": "Moɲoo woo {{PLURAL:$1|dumi tugante$ 1}} no m'a may:",
sq.json:        "databaseerror-query": "\nPyetje: $ 1",
sq.json:        "no-null-revision": "I pamundur krijimi rishikimi  i ri për faqen bosh \"$ 1\"",
sw.json:        "apihelp-no-such-module": "Moduli \"$ 1\" haikupatikana.",
ur.json:        "databaseerror-function": "فنکشن: $ 1",
ur.json:        "databaseerror-error": "خرابی: $ 1",

Another approach is to look for keywords.

For example, the source English message "viewcount" is "This page has been accessed {{PLURAL:$1|once|$1 times}}.", which uses the PLURAL magic word which implements grammatical number. See also Plural.

Using grep we can see all of the translations that have omitted the PLURAL quickly:

$ cd languages/i18n
$ grep '"viewcount"' *.json | egrep -v '(qqq.json|PLURAL)'
cv.json:	"viewcount": "Ку страницăна $1 хут пăхнă.",
ff.json:	"viewcount": "Ngoo hello yillaama laabi $1.",
gan-hans.json:	"viewcount": "个页拖人眵嘞$1回。",
gan-hant.json:	"viewcount": "箇頁拕人眵哩$1回。",
gn.json:	"viewcount": "Esta página ha sido visitada $1 veces.",
hak.json:	"viewcount": "邇隻頁面已經分人瀏覽過$1次。",
kk-arab.json:	"viewcount": "بۇل بەت $1 رەت قاتىنالعان.",
kk-latn.json:	"viewcount": "Bul bet $1 ret qatınalğan.",
lzh.json:	"viewcount": "此頁$1閱矣",
nan.json:	"viewcount": "Chit ia̍h kàu taⁿ, hō͘ lâng khoàⁿ $1 pái.",
th.json:	"viewcount": "มีการเข้าถึงหน้านี้ $1 ครั้ง",
to.json:	"viewcount": "Naʻe laua he pēsí ni tuʻo $1.",
wuu.json:	"viewcount": "箇頁望過$1垡。",
yue.json:	"viewcount": "呢一頁已經有$1人次睇過。",
zh-hans.json:	"viewcount": "本页面已经被访问过$1次。",
zh-hant.json:	"viewcount": "此頁面已被檢視過 $1 次。",

Some of these languages may not have need to use PLURAL support for this message. See http://www.unicode.org/cldr/charts/latest/supplemental/language_plural_rules.html for the official list of plural rules for all languages.

Unofficial language information

More information about each language can be found on the Wikipedia article about the language.

If Wikipedia doesn't have the answer, you might look for information about languages on the Linguistics StackExchange, such as this discussion about South-East Asian plurals. If you can't find an existing question on Linguistics StackExchange relating to your problem, you can ask a new question on Linguistics StackExchange.

Examples

These are examples of this task being done:

List of tools that can help to find these problems:

Event Timeline

jayvdb created this task.Oct 14 2016, 6:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2016, 6:23 AM

I think this would be a fun task. Very educational. I am happy to mentor it. We would need a few mentors who have broad experience in this area, in order to QA the corrections.

One aspect that needs to be carefully planned is avoiding bad quality fixes to messages, especially attempts to complete the task that result in changes to translations that are not submitted for review.
I suspect that the task should not encourage any live changes to TranslateWiki. Maybe the participant should create a task in Phabricator describing the correction?

This task would prepare participants for a more advanced task of improving the TranslateWiki rules to detect and prevent similar problems occurring again.

Nemo_bis added a subscriber: Nemo_bis.

The task description only mentions MediaWiki core messages, so I'm moving this task in that component. I have still no idea what this task is asking.

@jayvdb: Thanks for proposing this task!
It's still a bit unclear to me how exactly a contributor would "fix syntax errors". Doesn't this require being a nearly native speaker of the language to properly fix the grammar / translation, and then especially a change reviewer who also knows the language?

@jayvdb: Could you answer my last comment please?

Sorry @Aklapper , I have been rather busy. Syntax errors are syntax errors; not grammatical or other type of linguistic errors.

For example, see this syntax error detector that the wikipedia-ios team have built, and the accompanying fixer for one of those syntax problems.

Also you can see https://github.com/BesutKode/uni-task-1 for an example of uni students solving this type of problem by first finding some errors and then building their own tools to detect more problems. That uni task also allowed grammatically errors, as this is quite easy with existing tools and a decent amount of maturity... so grammatical errors may not suitable for high school students.

I recently ran a similar task with high school students here in Indonesia, for only syntax errors, with much less guidance than the above university student task and the high school children did quite well at identifying the problems with only a little mentoring required to get them started.

A remarkable difference is that we run checkers before the fact.

Nemo_bis updated the task description. (Show Details)Nov 14 2016, 9:40 PM

I've tried to improve the description. I suggest to not use the lack of PLURAL as an example: that's an eminently linguistic problem for translators, which has nothing to do with syntax. A syntax error would be to write {{PLURALE}} or {{PLURAL|uno=1}}.

Nemo_bis triaged this task as Low priority.Nov 14 2016, 9:43 PM
Nemo_bis updated the task description. (Show Details)Nov 14 2016, 9:48 PM

One way to make this task less problematic is for it to be find only - i.e no fixing. Many fixes require knowledge of the language, while detecting a bug is able to be done without that knowledge. The intention is that this task would be followed by a task to improve the TWN checkers.

Re PLURAL, happy to switch to a simpler example. Suggestions welcome.
For the sake of translatiin linting, it is better for PLURAL to be used even if the language doesnt need it. It is more common that it has been omitted because it is easier to omit it, than that the language doednt require it.

Some errors introduced by Apertium migth be easily avoided, that is variable like $link and so on should not be translated. Could the Apertium instance be tweaked to avoid translating tokens which begin with a dollar?

@Psychoslave: That would be a topic for a different venue - this task is about finding and fixing existing syntax errors, as part of Google-Code-In-2016.

Ok, sorry for the misunderstanding.

A remarkable difference is that we run checkers before the fact.

Could you clarify this?
Most similar translation frameworks also have checkers in place (e.g. weblate and Pootle, and the translate-toolkit checks are also used by Mozilla's Pontoon), but the ability for humans to find new ways to create buggy translations that slip past these checks is amazing, far exceeding the time available for translation infrastructure programmers to build better rules and checkers. But maybe, we can induct some GCI participants into the role of creating better rules and checkers...

For example, see this syntax error detector that the wikipedia-ios team have built, and the accompanying fixer for one of those syntax problems.

@Nemo_bis , do we have checkers for that? (and where can I find a list of existing checkers?) If so, those shell scripts in the iOS app repo can be removed. If not, perhaps we should create a task to migrate those iOS app checkers into the TWN infrastructure, so that any further improvements to those rules benefit all projects instead of only benefiting the iOS app. I have a probable GCI participant who is interested in tackling this.

Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptNov 17 2016, 6:48 AM
jayvdb added a subscriber: Xqt.Nov 17 2016, 8:12 AM

@Xqt, I believe you had some script to find problems in the pywikibot messages, and we have unittests in the Pywikibot-i18n repo run by travis

Xqt added a comment.Nov 17 2016, 9:20 PM

@jayvdb: I've started https://gerrit.wikimedia.org/r/#/c/221370/ some time ago. Might be I should continue there. In addition there was a suggester added to twn with https://gerrit.wikimedia.org/r/#/c/221610/ according to T98004.

do we have checkers for that? (and where can I find a list of existing checkers?)

In the translatewiki repository. It's certainly possible that wikipedia-ios folks are using some non-standard methods outside translatewiki.net, I don't know why they have those bash scripts.

I've read https://translatewiki.net/wiki/Localisation_guidelines. But there are 2 questions still bothering me.

  1. Exactly what do I need to do in this task
  2. How am I gonna find errors? I mean is there any finder like thing?

Thanks in advance.

jayvdb updated the task description. (Show Details)Dec 31 2016, 5:57 AM
jayvdb updated the task description. (Show Details)Dec 31 2016, 6:00 AM

Hiya @PratyyaGhosh, the task is to fix a technical bug in the translated messages (i.e. not do translation). The task gives some examples on how to find them, and I have just added some specific cases where the messages need fixing.
But ideally you get creative in trying to find syntax errors.
And if you find a syntax problem which occurs frequently, we can improve our tools to prevent it occurring again.

jayvdb updated the task description. (Show Details)Dec 31 2016, 6:49 AM
MtDu added a subscriber: MtDu.Dec 31 2016, 6:51 AM

@PratyyaGhosh

  1. You need to find a syntax error in a message. To do this, you must first find a message in the en.json file that translators could get confused with when translating. For example, in the 'tagline' message in core/languages/i18n/en.json has a {{SITENAME}} that should NOT be edited during translation. However, some languages have done so. As a result, you must then go on translatewiki.net, make an account, and fix this error. Some messages may have obvious syntax errors, but you will be unable to fix them because you are not a native speaker of the language. That is ok. Comment on the issue you found on the phab task here. A sample edit is here https://translatewiki.net/w/i.php?title=MediaWiki:Aboutsite/luz&diff=prev&oldid=7189506
  1. In order to find errors, you must first look in the en.json files through the messages to see if there is any possible source of error. Then, confirm this is an error by looking at the documentation for that message in the qqq.json file. Once you have done that, use git grep (https://git-scm.com/docs/git-grep) to look throughout the i18n folder for possible languages that have a syntax error.

Let's take this for example:
grep '"viewcount"' *.json | egrep -v '(qqq.json|PLURAL)'
This command will look for all 'viewcount' messages in all json files in the folder (languages/i18n/..) but exclude (egrep -v) all finds that already have PLURAL in them, as well as the qqq.json file.

If anything needs more clarification, feel free to ask me or @jayvdb or just post your question here.

Good luck!

GCI 2016 is over - either this should be turned into a specific good first task tasks (with criteria that allows declaring it as "resolved" at some point), or the status should be set to declined?

Nemo_bis closed this task as Resolved.Jan 30 2017, 3:30 PM
Nemo_bis claimed this task.

I think we can close as resolved, since some patches were merged. Can always have more (ideally more specific) such tasks in the future.

jayvdb reopened this task as Open.Oct 23 2017, 6:26 AM

This task was ok last year, and it is good to keep improving the same template task each year.

Maybe @MtDu might like to help mentor it this year? ;-)

IIRC, the "plural" example was more problematic than helpful .
We should use examples of this type of task done in 2016 instead of the plural example.

MtDu added a comment.Oct 23 2017, 1:09 PM

Yes, I would be glad to! @jayvdb

Ebe123 added a subscriber: Ebe123.

I'd be willing to mentor. @MtDu?

Aklapper removed Nemo_bis as the assignee of this task.Oct 7 2018, 1:25 AM
Aklapper updated the task description. (Show Details)Oct 7 2018, 1:38 AM
Aklapper updated the task description. (Show Details)Oct 7 2018, 1:42 AM
rafidaslam renamed this task from [CGI template easy task] Find and fix syntax errors in translated messages to [GCI template easy task] Find and fix syntax errors in translated messages.Oct 29 2018, 4:29 PM
rafidaslam awarded a token.
Shreyasminocha renamed this task from [GCI template easy task] Find and fix syntax errors in translated messages to [GCI easy task] Find and fix syntax errors in translated messages.Dec 3 2018, 8:55 AM