Create Content Translation Parallel Corpora dumps
Closed, ResolvedPublic
Actions

Description

We need to create and make Content Translation Parallel Corpora dumps (see: T122042) available for public at: https://dumps.wikimedia.org/

Script to provide dumps: scripts/dump-corpora.php
Frequency: Weekly

Details

Subject	Repo	Branch	Lines +/-
add cron job for Content Translation dumps	operations/puppet	production	+149 -0
DumpCorpora: Skip JSON formatting for old PHP versions	mediawiki/extensions/ContentTranslation	wmf/1.28.0-wmf.11	+7 -1
DumpCorpora: Skip JSON formatting for old PHP versions	mediawiki/extensions/ContentTranslation	wmf/1.28.0-wmf.12	+7 -1
DumpCorpora: Skip JSON formatting for old PHP versions	mediawiki/extensions/ContentTranslation	master	+7 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• santhosh	T95886 for ContentTranslation MT, store information about source content, machine-translated content and user-edited content
Resolved	• santhosh	T111905 Design the technical infrastructure for parallel corpora storage and api (tracking)
Resolved	ArielGlenn	T127793 Create Content Translation Parallel Corpora dumps
Resolved	Nikerabbit	T133006 Add compression support to scripts/dump-corpora.php
Resolved	Nikerabbit	T133007 Add --output option to scripts/dump-corpora.php

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Arrbee moved this task from Backlog to Parallel corpora on the ContentTranslation-Release9 board.Apr 20 2016, 10:00 AM

Amire80 moved this task from Backlog to CX9 on the Language-Engineering April-June 2016 board.Apr 20 2016, 10:16 AM

Amire80 moved this task from CX8 to CX9 on the ContentTranslation board.Apr 20 2016, 12:34 PM

Hydriz added a project: Datasets-Archiving.May 2 2016, 3:18 AM

Restricted Application added a subscriber: Hydriz. · View Herald TranscriptMay 2 2016, 3:18 AM

Arrbee edited projects, added Language-Q4-2016-Sprint 2; removed Language-Q4-2016-Sprint 1.May 2 2016, 7:24 AM

Blocking tasks for this (T133006, T133007) are merged, so updated script is available in master for testing before it hits production tomorrow.

Arrbee closed subtask T133006: Add compression support to scripts/dump-corpora.php as Resolved.May 17 2016, 6:59 AM

Arrbee closed subtask T133007: Add --output option to scripts/dump-corpora.php as Resolved.

@ArielGlenn Did you test updated script? It is in production now (although on testwiki/group1 at the time of writing) and available for testing.

Not yet, but I have been getting the calling script ready for testing (and for production runs). Until recently we've done all these sorts of dumps using a variety of one-off scripts; I've started coverting them so that they all use the same mechanism. But a few additional changes were needed and have to be tested in order to run the corpora dumps. Once that's good to go, I'll be doing a full run manually using that script and will relay the results here (times, mem usage, etc).

Thanks for update @ArielGlenn Let us know any changes needed in the ContentTranslation script.

KartikMistry edited projects, added Language-Q4-2016-Sprint 3; removed Language-Q4-2016-Sprint 2.May 20 2016, 3:48 AM

KartikMistry moved this task from Backlog to In Progress on the Language-Q4-2016-Sprint 3 board.May 20 2016, 5:05 AM

Arrbee edited projects, added Language-Q4-2016-Sprint 4; removed Language-Q4-2016-Sprint 3.May 31 2016, 7:24 AM

Arrbee moved this task from Backlog to Blocked on the Language-Q4-2016-Sprint 4 board.

Stalling for a while.

Arrbee removed a project: Language-Q4-2016-Sprint 4.Jun 14 2016, 6:37 AM

Amire80 moved this task from CX9 to CX10 (July-Sep. 2016) on the ContentTranslation board.Jul 20 2016, 6:21 AM

Arrbee edited projects, added ContentTranslation-Release10; removed ContentTranslation-Release9.Jul 21 2016, 8:39 AM

I'm currently running on snapshot1007 the following:

php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 10000 --outputdir /mnt/data/temp/ --compression gzip

Waiting to see what issues we encounter and/or what output is produced.

ArielGlenn moved this task from Up Next to Active on the Dumps-Generation board.Jul 27 2016, 9:07 AM

It produced one file:

-rw-rw-r-- 1 datasets datasets 98796066 Jul 27 09:15 cx-corpora._2_.html.json.gz

Is this expected?

Run time: about ten minutes.

I don't know how good is the compression ratio, but 100M compressed sounds a bit on the small side, but could be correct. The surprise is that it only created one file, I would expect some languages to cross the 10000 threshold and have separate files. If this is really so, we should consider making the threshold smaller.

Perhaps I should inspect the file manually to see if it is missing some content.

I had thought that running on one wiki generates all files; is that not true? At any rate, let me know a host where you have access and I'll put a copy of the file there.

Verified that file is visible on stat1003 in /mnt/data/temp

After discussion on IRC, rerunning with:

date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip ; date

start date/time is: Wed Jul 27 12:20:42 UTC 2016

I'll also try to debug myself with a bigger dataset as I suspect the script might be using excessive memory.

This time it gave me a fatal instead of silently exiting:

...

/mnt/data/tempcx-corpora.es2ca.html.json.gz
/mnt/data/tempcx-corpora.en2ca.html.json.gz
/mnt/data/tempcx-corpora.fr2ca.html.json.gz
/mnt/data/tempcx-corpora.no2nn.html.json.gz
/mnt/data/tempcx-corpora.en2pa.html.json.gz
/mnt/data/tempcx-corpora.en2nb.html.json.gz
/mnt/data/tempcx-corpora.nn2nb.html.json.gz
/mnt/data/tempcx-corpora.ru2uk.html.json.gz
/mnt/data/tempcx-corpora.en2uk.html.json.gz
/mnt/data/tempcx-corpora.en2fr.html.json.gz
/mnt/data/tempcx-corpora.es2fr.html.json.gz
/mnt/data/tempcx-corpora.en2vi.html.json.gz
/mnt/data/tempcx-corpora.ru2ba.html.json.gz
/mnt/data/tempcx-corpora.es2gl.html.json.gz
/mnt/data/tempcx-corpora.es2ast.html.json.gz
/mnt/data/tempcx-corpora.ru2kk.html.json.gz
/mnt/data/tempcx-corpora.en2el.html.json.gz
/mnt/data/tempcx-corpora.en2cs.html.json.gz
/mnt/data/tempcx-corpora.en2sq.html.json.gz
/mnt/data/tempcx-corpora.en2tr.html.json.gz
/mnt/data/tempcx-corpora.en2tl.html.json.gz
/mnt/data/tempcx-corpora.en2pl.html.json.gz
/mnt/data/tempcx-corpora.en2sr.html.json.gz
/mnt/data/tempcx-corpora.en2nl.html.json.gz
/mnt/data/tempcx-corpora.en2ro.html.json.gz
/mnt/data/tempcx-corpora.en2bn.html.json.gz
/mnt/data/tempcx-corpora.en2th.html.json.gz
/mnt/data/tempcx-corpora.en2ta.html.json.gz
/mnt/data/tempcx-corpora.en2ru.html.json.gz
/mnt/data/tempcx-corpora.uk2ru.html.json.gz
/mnt/data/tempcx-corpora.en2de.html.json.gz
/mnt/data/tempcx-corpora.en2ja.html.json.gz
/mnt/data/tempcx-corpora.en2ko.html.json.gz
/mnt/data/tempcx-corpora.en2it.html.json.gz
/mnt/data/tempcx-corpora.en2he.html.json.gz
/mnt/data/tempcx-corpora.en2zh.html.json.gz
/mnt/data/tempcx-corpora.en2fa.html.json.gz
/mnt/data/tempcx-corpora.en2ar.html.json.gz
/mnt/data/tempcx-corpora.zh2hak.html.json.gz
/mnt/data/tempcx-corpora._2hak.html.json.gz
/mnt/data/tempcx-corpora.fr2en.html.json.gz
/mnt/data/tempcx-corpora.es2en.html.json.gz
Fatal error: Out of memory (allocated 9417523200) (tried to allocate 18446744071930920519 bytes) in /srv/mediawiki/php-1.28.0-wmf.11/includes/json/FormatJson.php on line 152
Wed Jul 27 12:27:10 UTC 2016

Lines in question:

149                 if ( $pretty !== false ) {
150                         // Workaround for <https://bugs.php.net/bug.php?id=66021>
151                         if ( $bug66021 ) {
152                                 $json = preg_replace( self::WS_CLEANUP_REGEX, '', $json );
153                         }

Doing a preg_replace in string in size of many megabytes is probably not the most efficient thing memory-wise. According to https://github.com/php/php-src/commit/82a4f1a1a287d9dbf01156bc14ceb13ccbf16d7a it is fixed since PHP 5.5.12 in PHP 5.5.x branch. The scripts is being run with PHP 5.5.9-1ubuntu4.17.

after more conversation on irc, trying

date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip --format tmx; date

to see if it completes.

Ended after a bit over ten minutes with this:

Wed Jul 27 12:42:38 UTC 2016
TMX output format is only supported with plaintext

Wed Jul 27 12:53:21 UTC 2016

Running again as:

Output:

Wed Jul 27 13:16:51 UTC 2016
/mnt/data/tempcx-corpora.es2pt.text.tmx.gz
/mnt/data/tempcx-corpora.en2pt.text.tmx.gz
/mnt/data/tempcx-corpora.en2id.text.tmx.gz
/mnt/data/tempcx-corpora.ca2es.text.tmx.gz
/mnt/data/tempcx-corpora.en2es.text.tmx.gz
/mnt/data/tempcx-corpora.pt2es.text.tmx.gz
/mnt/data/tempcx-corpora.fr2es.text.tmx.gz
/mnt/data/tempcx-corpora.es2ca.text.tmx.gz
/mnt/data/tempcx-corpora.en2ca.text.tmx.gz
/mnt/data/tempcx-corpora.fr2ca.text.tmx.gz
/mnt/data/tempcx-corpora.no2nn.text.tmx.gz
/mnt/data/tempcx-corpora.en2pa.text.tmx.gz
/mnt/data/tempcx-corpora.en2nb.text.tmx.gz
/mnt/data/tempcx-corpora.nn2nb.text.tmx.gz
/mnt/data/tempcx-corpora.ru2uk.text.tmx.gz
/mnt/data/tempcx-corpora.en2uk.text.tmx.gz
/mnt/data/tempcx-corpora.en2fr.text.tmx.gz
/mnt/data/tempcx-corpora.es2fr.text.tmx.gz
/mnt/data/tempcx-corpora.en2vi.text.tmx.gz
/mnt/data/tempcx-corpora.ru2ba.text.tmx.gz
/mnt/data/tempcx-corpora.es2gl.text.tmx.gz
/mnt/data/tempcx-corpora.es2ast.text.tmx.gz
/mnt/data/tempcx-corpora.ru2kk.text.tmx.gz
/mnt/data/tempcx-corpora.en2el.text.tmx.gz
/mnt/data/tempcx-corpora.en2cs.text.tmx.gz
/mnt/data/tempcx-corpora.en2sq.text.tmx.gz
/mnt/data/tempcx-corpora.en2tr.text.tmx.gz
/mnt/data/tempcx-corpora.en2tl.text.tmx.gz
/mnt/data/tempcx-corpora.en2pl.text.tmx.gz
/mnt/data/tempcx-corpora.en2sr.text.tmx.gz
/mnt/data/tempcx-corpora.en2nl.text.tmx.gz
/mnt/data/tempcx-corpora.en2ro.text.tmx.gz
/mnt/data/tempcx-corpora.en2bn.text.tmx.gz
/mnt/data/tempcx-corpora.en2th.text.tmx.gz
/mnt/data/tempcx-corpora.en2ta.text.tmx.gz
/mnt/data/tempcx-corpora.en2ru.text.tmx.gz
/mnt/data/tempcx-corpora.uk2ru.text.tmx.gz
/mnt/data/tempcx-corpora.en2de.text.tmx.gz
/mnt/data/tempcx-corpora.en2ja.text.tmx.gz
/mnt/data/tempcx-corpora.en2ko.text.tmx.gz
/mnt/data/tempcx-corpora.en2it.text.tmx.gz
/mnt/data/tempcx-corpora.en2he.text.tmx.gz
/mnt/data/tempcx-corpora.en2zh.text.tmx.gz
/mnt/data/tempcx-corpora.en2fa.text.tmx.gz
/mnt/data/tempcx-corpora.en2ar.text.tmx.gz
/mnt/data/tempcx-corpora.zh2hak.text.tmx.gz
/mnt/data/tempcx-corpora._2hak.text.tmx.gz
/mnt/data/tempcx-corpora.fr2en.text.tmx.gz
/mnt/data/tempcx-corpora.es2en.text.tmx.gz
/mnt/data/tempcx-corpora._2_.text.tmx.gz
Wed Jul 27 13:36:00 UTC 2016

Two questions: what is the last file? And, are all the contents present in the dump?

Change 301548 had a related patch set uploaded (by Nikerabbit):
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301548

The last file contains "everything else that did not get their own file already". I am checking to see if all expected content is in these files.

KartikMistry added a project: Language-Q1-2016-17 Sprint 2.Jul 28 2016, 7:13 AM

KartikMistry moved this task from Backlog to In Progress on the Language-Q1-2016-17 Sprint 2 board.

Nikerabbit changed the task status from Stalled to Open.Jul 28 2016, 7:16 AM

Amire80 moved this task from Backlog to Other on the ContentTranslation-Deployments board.Jul 28 2016, 11:42 AM

Change 301548 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301548

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-08-02_(1.28.0-wmf.13)).Jul 28 2016, 2:00 PM

Change 301601 had a related patch set uploaded (by KartikMistry):
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301601

Change 301602 had a related patch set uploaded (by KartikMistry):
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301602

Change 301602 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301602

Change 301601 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301601

datasets@snapshot1007:~$ date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip ; date
Thu Jul 28 15:27:53 UTC 2016
/mnt/data/tempcx-corpora.es2pt.html.json.gz
/mnt/data/tempcx-corpora.en2pt.html.json.gz
/mnt/data/tempcx-corpora.en2id.html.json.gz
/mnt/data/tempcx-corpora.ca2es.html.json.gz
/mnt/data/tempcx-corpora.en2es.html.json.gz
/mnt/data/tempcx-corpora.pt2es.html.json.gz
/mnt/data/tempcx-corpora.fr2es.html.json.gz
/mnt/data/tempcx-corpora.es2ca.html.json.gz
/mnt/data/tempcx-corpora.en2ca.html.json.gz
/mnt/data/tempcx-corpora.fr2ca.html.json.gz
/mnt/data/tempcx-corpora.no2nn.html.json.gz
/mnt/data/tempcx-corpora.en2pa.html.json.gz
/mnt/data/tempcx-corpora.en2nb.html.json.gz
/mnt/data/tempcx-corpora.nn2nb.html.json.gz
/mnt/data/tempcx-corpora.ru2uk.html.json.gz
/mnt/data/tempcx-corpora.en2uk.html.json.gz
/mnt/data/tempcx-corpora.en2fr.html.json.gz
/mnt/data/tempcx-corpora.es2fr.html.json.gz
/mnt/data/tempcx-corpora.en2vi.html.json.gz
/mnt/data/tempcx-corpora.ru2ba.html.json.gz
/mnt/data/tempcx-corpora.es2gl.html.json.gz
/mnt/data/tempcx-corpora.es2ast.html.json.gz
/mnt/data/tempcx-corpora.ru2kk.html.json.gz
/mnt/data/tempcx-corpora.en2el.html.json.gz
/mnt/data/tempcx-corpora.en2cs.html.json.gz
/mnt/data/tempcx-corpora.en2sq.html.json.gz
/mnt/data/tempcx-corpora.en2tr.html.json.gz
/mnt/data/tempcx-corpora.en2tl.html.json.gz
/mnt/data/tempcx-corpora.en2pl.html.json.gz
/mnt/data/tempcx-corpora.en2sr.html.json.gz
/mnt/data/tempcx-corpora.en2nl.html.json.gz
/mnt/data/tempcx-corpora.en2ro.html.json.gz
/mnt/data/tempcx-corpora.en2bn.html.json.gz
/mnt/data/tempcx-corpora.en2th.html.json.gz
/mnt/data/tempcx-corpora.en2ta.html.json.gz
/mnt/data/tempcx-corpora.en2ru.html.json.gz
/mnt/data/tempcx-corpora.uk2ru.html.json.gz
/mnt/data/tempcx-corpora.en2de.html.json.gz
/mnt/data/tempcx-corpora.en2ja.html.json.gz
/mnt/data/tempcx-corpora.en2ko.html.json.gz
/mnt/data/tempcx-corpora.en2it.html.json.gz
/mnt/data/tempcx-corpora.en2he.html.json.gz
/mnt/data/tempcx-corpora.en2zh.html.json.gz
/mnt/data/tempcx-corpora.en2fa.html.json.gz
/mnt/data/tempcx-corpora.en2ar.html.json.gz
/mnt/data/tempcx-corpora.zh2hak.html.json.gz
/mnt/data/tempcx-corpora._2hak.html.json.gz
/mnt/data/tempcx-corpora.fr2en.html.json.gz
/mnt/data/tempcx-corpora.es2en.html.json.gz
/mnt/data/tempcx-corpora._2_.html.json.gz
Thu Jul 28 15:35:00 UTC 2016
datasets@snapshot1007:~$

@Nikerabbit can you verify that all the contents are there? same place as always, /mnt/data/temp

Thanks @ArielGlenn . I will check tomorrow.

ReleaseTaggerBot added projects: MW-1.28-release (WMF-deploy-2016-07-19_(1.28.0-wmf.11)), MW-1.28-release (WMF-deploy-2016-07-26_(1.28.0-wmf.12)).Jul 28 2016, 4:00 PM

Number of unique drafts in the dump files: 51032 [1]
Number of published drafts in the database: 52921 [2]

About 2000 new things are published per week, so this does not quite explain the difference. My theory is that some of the drafts are actually empty (i.e. no user content at all, only machine translation or so) which we filter out:

// Some general cleanup
foreach ( $translation['corpora'] as $id => $unit ) {
	if ( !isset( $unit['user'] ) ) {
		unset( $translation['corpora'][$id] );
		continue;
	}

Same check for en2de gives 304 and 306, the two are new given they have bigger translation_id.

In addition [3] indicates that no file was truncated before compression. Thus as far as I can see these files are valid.

[1] zgrep -E '"id":"([^"]+)"' -o cx-corpora.*2*.html.json.gz | grep -E -o '[0-9]+/' | sort | uniq | wc -l
[2] select count(*) from cx_translations, cx_corpora where (translation_status = 'published' or translation_target_url is not null) and translation_id = cxc_translation_id group by translation_id;
[3] zgrep -E -L '\}\}\]$' *.html.json*

@ArielGlenn Do you see any more blockers to schedule json html, json plaintext and tmx plaintext dumps and make them available on dumps.wikimedia.org?

No, this looks great. I'll get going on that shortly.

Change 301773 had a related patch set uploaded (by ArielGlenn):
add cron job for Content Translation dumps

https://gerrit.wikimedia.org/r/301773

ArielGlenn mentioned this in rOPUP73b7f8723342: add cron job for Content Translation dumps.Jul 29 2016, 9:20 AM

ArielGlenn mentioned this in rOPUP27a995d4ad35: add cron job for Content Translation dumps.Jul 29 2016, 9:30 AM

ArielGlenn mentioned this in rOPUP8a527f6aca13: add cron job for Content Translation dumps.Jul 29 2016, 11:18 AM

ArielGlenn mentioned this in rOPUP87d455832ccb: add cron job for Content Translation dumps.Jul 29 2016, 11:23 AM

Amire80 moved this task from Backlog to Parallel corpora on the ContentTranslation-Release10 board.Jul 31 2016, 11:21 AM

ArielGlenn mentioned this in rOPUP5eac05039e3d: add cron job for Content Translation dumps.Aug 1 2016, 8:55 AM

Nikerabbit removed projects: MW-1.28-release (WMF-deploy-2016-07-26_(1.28.0-wmf.12)), MW-1.28-release (WMF-deploy-2016-07-19_(1.28.0-wmf.11)), MW-1.28-release (WMF-deploy-2016-08-02_(1.28.0-wmf.13)).Aug 2 2016, 6:32 AM

Restricted Application added a project: Language-Engineering July-September 2016. · View Herald TranscriptAug 2 2016, 6:32 AM

KartikMistry moved this task from In Progress to In Review on the Language-Q1-2016-17 Sprint 2 board.Aug 2 2016, 6:32 AM

Arrbee assigned this task to ArielGlenn.Aug 2 2016, 6:49 AM

Arrbee edited projects, added Language-Q1-2016-17 Sprint 3; removed Language-Q1-2016-17 Sprint 2.Aug 2 2016, 6:54 AM

ArielGlenn mentioned this in rOPUP0cfbb05b8b80: add cron job for Content Translation dumps.Aug 2 2016, 9:59 AM

Can you folks add a -q flag to your maintenance script so it doesn't print out the names of the files when it runs properly? That should be the last thing I need, the standalone script seems to run ok.

ArielGlenn mentioned this in rOPUPe282c887425e: add cron job for Content Translation dumps.Aug 2 2016, 10:18 AM

The standard -q/--quiet flag implemented by the Maintenance class should work. Let me know is this is not the case.

ohhh nice, let me try that right now.

That did the trick, thanks! Please have a look at //gerrit.wikimedia.org/r/301773 and see if there's anything there that you don't like. I expect to merge some version of it within the next couple of days.

ArielGlenn mentioned this in rOPUP81ad75e45c52: add cron job for Content Translation dumps.Aug 2 2016, 12:42 PM

ArielGlenn mentioned this in rOPUPb7a23bf045d8: add cron job for Content Translation dumps.Aug 2 2016, 1:53 PM

ArielGlenn mentioned this in rOPUP8ee4515947a9: add cron job for Content Translation dumps.Aug 2 2016, 5:36 PM

ArielGlenn mentioned this in rOPUP3d683754dff7: add cron job for Content Translation dumps.Aug 2 2016, 7:52 PM

Change 301773 merged by ArielGlenn:
add cron job for Content Translation dumps

https://gerrit.wikimedia.org/r/301773

Merged. Set to run on Friday morning. I'll be watching.

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Aug 2 2016, 8:10 PM

Arrbee moved this task from Backlog to In Review on the Language-Q1-2016-17 Sprint 3 board.Aug 3 2016, 5:42 AM

Nikerabbit moved this task from In Review to QA on the Language-Q1-2016-17 Sprint 3 board.Aug 3 2016, 11:20 AM

Nikerabbit moved this task from Backlog to CX10 on the Language-Engineering July-September 2016 board.