Page MenuHomePhabricator

Tool for mass import of external translations
Closed, ResolvedPublic4 Estimated Story PointsFeature

Description

Problem

Have some content with translations that should be imported into a translatable page. Doing it manually translation by translation takes too much time. It should be possible to import translations from a structured format.

Possible solution

Implement a maintenance script that takes a CSV file as an input. Columns are languages, language code is header. Rows are translations in different languages. First row is translation unit id. To make it easier to match translations to the translation unit ids, provide an option to export a template CSV file with translation unit ids in the first column, and message definitions in the second column.

This script should work for any type of message group, but translatable pages is expected to be the most common use case.

When importing, "translations" for the source language are ignored. Empty values are ignored. Existing translations are not overwritten unless --overwrite option is given.

The script must validate language codes and translation unit ids and not import anything if there are issues with them. The script should perform Unicode normalization on the input text.

The script should have command line options for giving edit summary and the performing user. Maybe also default to dry-run mode to validate input and show what would happen.

By using a command line script, we do not have to deal with complexities of using a job queue. In case the number of such requests increase, a self-service web interface is an option for later.

Event Timeline

Movement Strategy and Governance is maintaining a termbase of about 600 wikispecific lexemes and their translations in 16 languages, soon to be about 25 languages. This internal resource is supposed to be shared with the movement on Meta soon, but to make it editable by using the translation extension and to enable future growth in any direction, we want to export it, which is yet not possible. This script would help a lot in sharing.

Addendum: plenty of teams at WMF and even chapters are frequently working with translations from translation agencies and external translators and have to publish them on Meta. Importing these by such a script will probably help them all to save a significant amount of time. I assume, it would be very helpful to create that interface soon in a second step.

Perhaps a silly question: why don't they just write the translations directly in our system?

That would require trainings of the respective translators. While I don't know, if translation agencies are open for this at all due to their internal workflows (might depend) some other points are:

  • Texts might be subject to a blocking period before published (which would require externals to have access to collab, putting the workload to move the translations again on the team)
  • Translators for translation agencies are not always the same persons.
  • Texts translated by translation agencies are not always ready for publication, but need a review and some corrections first.

I like this idea if it would make the process less manual than it is currently. I've often found myself spending an entire work day copy/pasting translated text from a Google doc into the Translation interface on-wiki. Given that I'm a monoglot I'm surprised I haven't messed up entire pages (yet).

How would the translations get into the csv file? Would we be asking translators to provide their translations within a spreadsheet? Is that an easy thing for them to do? My immediate thought is that we might end up just shifting the difficult part of the translation workflow.

Change 802130 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] Allow CSV export for WikiPageMessageGroup

https://gerrit.wikimedia.org/r/802130

Here's the plan that we have:

CSV Export

We will allow users to export a CSV from the Special:ExportTranslations page that will allow users to download a CSV in the following format:

Translation Unit IDMessage Definition<language code>
.........

The <language code> will refer to the language in which the export has been requested. Under this column the translations to that specific language will be provided.

CSV Import

The exported file can be updated with the translations that need to be imported. More languages with translations can be added as columns:

Translation unitMessage definitionesfrit
Main_Page_Translate/3Hello World<Spanish translation><French translation><Italian translation>
Main_Page_Translate/4This is the first translation page<Spanish translation><French translation><Italian translation>
Main_Page_Translate/6{{Template:Hello_World_1234}}<Spanish translation><French translation><Italian translation>
Main_Page_Translate/7Replace my stuff ...<Spanish translation><French translation><Italian translation>

In the above table, the translations have been added in Spanish, French and Italian. The translation unit and message definition column values should not be changed.

This CSV file can then be imported via a command line script.

abi_ set the point value for this task to 4.

How would the translations get into the csv file? Would we be asking translators to provide their translations within a spreadsheet? Is that an easy thing for them to do? My immediate thought is that we might end up just shifting the difficult part of the translation workflow.

This task doesn't solve that problem. A structured format is needed for mass imports, but it doesn't need to be CSV as long as conversion to CSV is easy. It really depends on the capabilities of the translators what format is best for them: the translation interface, Gettext format, a spreadsheet or something else.

Change 803294 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] Add script to export translations

https://gerrit.wikimedia.org/r/803294

Change 802130 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Allow CSV export for WikiPageMessageGroup

https://gerrit.wikimedia.org/r/802130

Here's an example CSV that can be exported from Special:ExportTranslations for a translatable page in French:

Translation message titleMessage definitionfr
MediaWiki:t1bunnybunny - fr 123
MediaWiki:t3bunnybunny - fr 123
MediaWiki:t4fanny

That CSV can then be modified to have update the French translations, and add translations in other languages as well:

Translation message titleMessage definitionfreshi
MediaWiki:t1bunnybunny - frbunny - esbunny - hi
MediaWiki:t3bunnybunny - frbunny - hi
MediaWiki:t4fannyfanny - hi

This file can then be imported via a command line script on the server:

# to first see what will be imported
$ php extensions/Translate/scripts/importTranslationsFromCsv.php ~/Projects/html/mediawiki/groups/page-Main\ Page\ Translate_fr.csv

* 3 translation(s) to import for MediaWiki:t1
* 2 translation(s) to import for MediaWiki:t3
* 1 translation(s) to import for Main MediaWiki:t4

# then to actually perform the import, add the "--really" flag, along with the "--user" and "--summary" options
$ php extensions/Translate/scripts/importTranslationsFromCsv.php ~/Projects/html/mediawiki/groups/page-Main\ Page\ Translate_fr.csv --really --user Admin --summary "Testing import via CSV"

* 3 translation(s) to import for Translations:Main Page MediaWiki:t1
* 2 translation(s) to import for Translations:Main Page MediaWiki:t3
* 2 translation(s) to import for Translations:Main Page MediaWiki:t4

Proceeding with import...

(1/3) Imported translations for MediaWiki:t1 with 0 failure(s) and 3 successful import(s) ... 
(2/3) Imported translations for MediaWiki:t3 with 0 failure(s) and 2 successful import(s) ... 
(3/3) Imported translations for MediaWiki:t4 with 0 failure(s) and 1 successful import(s) ... 

Success: Import done

Change 809101 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] Output the source language titles in CSV export

https://gerrit.wikimedia.org/r/809101

Change 803294 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Add script to import translations from CSV file

https://gerrit.wikimedia.org/r/803294

Change 809101 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Output the source language titles in CSV export

https://gerrit.wikimedia.org/r/809101

Added documentation for this feature here: https://www.mediawiki.org/wiki/User:APatro_(WMF)/Import_Translations_via_CSV

Will move it to a more appropriate place later.

Tested on Translatewiki.net. Leaving open for a few days in case anyone wants to test this on Wikimedia servers.

For https://meta.wikimedia.org/wiki/Movement_Strategy_and_Governance/Termbase/Table, what is the desired workflow to request the import of translations?

(edit: Following the advice of @abi_ , I created a fresh task: T313051)

Change 814716 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] CSV Translation Import: Allow upper case language codes

https://gerrit.wikimedia.org/r/814716

Change 814716 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] CSV Translation Import: Allow upper case language codes

https://gerrit.wikimedia.org/r/814716

This tool was used to import a large CSV file: T313051: Mass import translations from CSV file for MSG termbase and the process completed without errors. There were two improvements identified:

  • Reduce the amount of logs that are generated as a result of mass imports
  • Optimize the script to reduce the creation of TranslateRenderJobs

I'll be creating these as separate tasks.