Page MenuHomePhabricator

Integrate multilingual dataset from the Translate extension into OpusMT
Closed, ResolvedPublic2 Estimated Story Points

Description

Recent work to use MinT to provide initial translations for the Translate extension (T338131), including its use in Translatewiki.net (T340544) can be complemented by exporting the final translations into a dataset and integrating it into the Opus project.

In this way, the corpus of multilingual text can be expanded with the data form localization strings and translatable pages. Resulting in more data to train the next version of the models.

As an initial step we may want to generate some samples that can be helpful to coordinate with the Opus team and make sure the format provided is a useful one.


In a similar effort, published translations from Content translation are already integrated into Opus.

Event Timeline

We have the ExportTranslationsMaintenanceScript that can be used to export translations for projects on translatewiki.net:

php maintenance/run.php /srv/mediawiki/workdir/extensions/Translate/scripts/export.php --target /home/abi/tmp/export --group discordwikibot --lang "*" --offline-gettext-format "" --skip-group-sync-chec

The above command exports messages from the discrodwikibot project in all languages in the Gettext format.

I've attached two files as samples on how the output looks:

One file per project per language will be generated. We can gzip all of the files and provide a single file for OpusMT to download.

Multiple projects can be exported at once:

php maintenance/run.php /srv/mediawiki/workdir/extensions/Translate/scripts/export.php --target /home/abi/tmp/export --group anvesha,cita,discordwikibot --lang "*" --offline-gettext-format "" --skip-group-sync-check

looks like:

.
|-- anvesha
|   |-- aa.po
|   |-- aae.po
|   |-- ab.po
|   |-- abs.po
|   |-- ace.po
|   |-- acf.po
|   |-- acm.po
|   |-- ada.po
|   |-- ady-cyrl.po
|   |-- ady.po
|   |-- aeb-arab.po
|   |-- ....
|   |-- ....
|   |-- ....
|   |-- zh-hk.po
|   |-- zh-min-nan.po
|   |-- zh-mo.po
|   |-- zh-my.po
|   |-- zh-sg.po
|   |-- zh-tw.po
|   |-- zh-yue.po
|   |-- zh.po
|   `-- zu.po
|-- cita
|   |-- aa.po
|   |-- aae.po
|   |-- ab.po
|   |-- abs.po
|   |-- ace.po
|   |-- acf.po
|   |-- acm.po
|   |-- ada.po
|   |-- ady-cyrl.po
|   |-- ady.po
|   |-- aeb-arab.po
|   |-- aeb-latn.po
|   |-- aeb.po
|   |-- af.po
|   |-- ahr.po
|   |-- ajg.po
|   |-- akz.po
|   |-- ....
|   |-- ....
|   |-- ....
|   |-- zh-hant.po
|   |-- zh-hk.po
|   |-- zh-min-nan.po
|   |-- zh-mo.po
|   |-- zh-my.po
|   |-- zh-sg.po
|   |-- zh-tw.po
|   |-- zh-yue.po
|   |-- zh.po
|   `-- zu.po
`-- discordwikibot
    |-- aa.po
    |-- aae.po
    |-- ab.po
    |-- abs.po
    |-- ace.po
    |-- acf.po
    |-- acm.po
    |-- ada.po
    |-- ady-cyrl.po
    |-- aeb-arab.po
    |-- aeb-latn.po
    |-- af.po
    |-- ahr.po
    |-- ajg.po
    |-- akz.po
    |-- ale-cyrl.po
    |-- ....
    |-- ....
    |-- ....
    |-- ....
    |-- xsy.po
    |-- yi.po
    |-- yo.po
    |-- yoi.po
    |-- yrk.po
    |-- yrl.po
    |-- yua.po
    |-- yue-hans.po
    |-- yue-hant.po
    |-- za.po
    |-- zea.po
    |-- zgh.po
    |-- zh-hans.po
    |-- zh-hant.po
    |-- zh-hk.po
    `-- zu.po

We can use --group "*" to export all the groups configured on translatewiki.net.

For OpusMT to grab these files, we can have these gzipped and available on the translatewiki.net server where they can be downloaded.

I've provided a dump of all the translations from translatewiki.net in the Gettext PO format to the OpusMT team. Waiting to hear from them if the data is ingest-able / useful.

The OpusMT team said the tarball containing all translations from translatewiki.net could be processed.

Regarding, how frequent the updates need to be, this was the response:

I guess we would not need very frequent updates, since each would have to be a new version in OPUS. So yearly updates would probably be fine.

I think we can write a script that runs every 6 months, and sends out an email. Since these tarballs can be huge (2GB each), I don't think we need to maintain older tarballs just the most recent one. The tarball can be placed under: /www/translatewiki.net/docroot/static/ as twn-translation-dump.tar.gz. The tarball can include an INFO file, that has information on when the tarball was created. Each time the script runs, it override the previous tarball.

@Nikerabbit , @Wangombe - Thoughts?

I think having an INFO file outside the tarball would provide a shorter path to reading the file without having to download a whole 2GB of data just to find out that it may not be useful. Both these files can sit inside a directory with the name twn-translation-dump. A simpler alternative would be to add the date to the name of the tarball. Something like twn-translation-dump-29-04-24.tar.gz.

I think having an INFO file outside the tarball would provide a shorter path to reading the file without having to download a whole 2GB of data just to find out that it may not be useful. Both these files can sit inside a directory with the name twn-translation-dump. A simpler alternative would be to add the date to the name of the tarball. Something like twn-translation-dump-29-04-24.tar.gz.

I think the idea of having the INFO file outside makes sense, the INFO file could the name of the latest dump that users can download.

Why not just enable directory listing of the dumps directory in nginx? Like things in https://dumps.wikimedia.org/ have.

The OpusMT team said the tarball containing all translations from translatewiki.net could be processed.

Regarding, how frequent the updates need to be, this was the response:

I guess we would not need very frequent updates, since each would have to be a new version in OPUS. So yearly updates would probably be fine.

Considering that the OpusMT team requires dumps only once a year, I'm inclined to mark this task as done, and build the script to automate the creation of the dump as part of T299493: Provide regular dumps of translations from translatewiki.net.

abi_ changed the point value for this task from 4 to 2.Thu, May 9, 5:03 AM