Page MenuHomePhabricator

Provide regular dumps of translations from translatewiki.net
Closed, ResolvedPublic4 Estimated Story Points

Description

There is interest for having the translations available as dumps to allow efficient use of the translations in other translation memories etc. The message collection Action API interface has performance concerns that make its use not suitable for this.

We can provide dumps as files that are fast to download without using CPU resources.

Event Timeline

Change 755356 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] Allow exporting AggregateMessageGroups in offline format via CLI

https://gerrit.wikimedia.org/r/755356

Change 755356 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Allow exporting AggregateMessageGroups in offline format via CLI

https://gerrit.wikimedia.org/r/755356

Hello! Just for follow-up. Code was merged many months ago. Are dumps already happening? Thanks!

@Toniher I believe this got stalled on waiting feedback about which message groups to export, in combined file per language, or one file per language per group, in which format (is gettext okay?) and which frequency is sufficient for upates?

@Nikerabbit I think we already commented and Gettext would be OK, but let me ping @Txemaq so he can tell more about...

If possible, we prefer periodic dumps as follow:

  • monthly dump
  • a resulting folder with a predictable name (e.g. "language" directory with the language code) containing files for all translated Mediawiki extensions to this language.
  • format for resulting files: gettext (po) format.

Thanks in advance!

Hi @Nikerabbit ! If there is anything else you think we could help on this, please let us know! Thanks!

Hello @Nikerabbit ! Happy New Year! Let us know if we could help on this! Thanks!

Nikerabbit moved this task from Backlog to System admin stuff on the translatewiki.net board.
abi_ subscribed.

We're planning to automate the generation of the dumps, this would involve the following:

  1. Write a script that creates an tarball containing the translations, and it also generates an info.txt file that contains the date of when the file was created. It should also remove the old translation file.
  2. Automate the running of the script using systemd. Example: https://gerrit.wikimedia.org/r/c/translatewiki/+/1027522

I used this script to generate the tarball for OpusMT team:

#!/bin/bash

cd /srv/mediawiki/workdir

groups=$(php maintenance/run.php ./extensions/Translate/scripts/expand-groupspec.php --exportable '*')

while IFS= read -r group; do
  echo "Exporting $group .."
  php maintenance/run.php /srv/mediawiki/workdir/extensions/Translate/scripts/export.php --target /home/abi/tmp/export --group "$group" --lang "*" --offline-gettext-format "" --skip-group-sync-check
done <<< "$groups"

The file can be created under /www/translatewiki.net/docroot/static/translation-dump/ with the name - translations-2024-05-09.tar.gz as an example, the info.txt file will contain the name of the latest translation dump file. Automated scripts can read the info.txt file that will always be present and will contain the name of the latest dump file.

abi_ changed the task status from Open to In Progress.May 14 2024, 4:10 AM

Change #1031444 had a related patch set uploaded (by Wangombe; author: Wangombe):

[translatewiki@master] Add script to automate export of translation dumps

https://gerrit.wikimedia.org/r/1031444

Change #1034837 had a related patch set uploaded (by Wangombe; author: Wangombe):

[translatewiki@master] Add automation script to export translation dumps every 6 months

https://gerrit.wikimedia.org/r/1034837

Change #1031444 merged by jenkins-bot:

[translatewiki@master] Add script to automate export of translation dumps

https://gerrit.wikimedia.org/r/1031444

Change #1034837 merged by jenkins-bot:

[translatewiki@master] Add automation script to export translation dumps every 6 months

https://gerrit.wikimedia.org/r/1034837

Ran a test build of the script that generated an archive: https://translatewiki.net/static/translation-dump/info.txt

Deployed the script and puppet changes on translatewiki.net

abi_ changed the point value for this task from 2 to 4.May 29 2024, 12:21 PM

Change #1038250 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[translatewiki@master] Add dir listing for translation dumps

https://gerrit.wikimedia.org/r/1038250

I think we also need some automated clean-up to avoid accumulating too many dumps.

Change #1038250 merged by jenkins-bot:

[translatewiki@master] Add dir listing for translation dumps

https://gerrit.wikimedia.org/r/1038250

Screenshot of the extracted tarball.

image.png (2×2 px, 1 MB)

the documentation is available through this link: https://translatewiki.net/wiki/Dumps

Change #1047438 had a related patch set uploaded (by Wangombe; author: Wangombe):

[translatewiki@master] Remove existing tarball if any before saving a new translation dump.

https://gerrit.wikimedia.org/r/1047438

Change #1047438 merged by jenkins-bot:

[translatewiki@master] Remove existing tarball if any before saving a new translation dump.

https://gerrit.wikimedia.org/r/1047438