Page MenuHomePhabricator

Create Content Translation Parallel Corpora dumps
Closed, ResolvedPublic

Description

We need to create and make Content Translation Parallel Corpora dumps (see: T122042) available for public at: https://dumps.wikimedia.org/

  • Script to provide dumps: scripts/dump-corpora.php
  • Frequency: Weekly

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Blocking tasks for this (T133006, T133007) are merged, so updated script is available in master for testing before it hits production tomorrow.

@ArielGlenn Did you test updated script? It is in production now (although on testwiki/group1 at the time of writing) and available for testing.

Not yet, but I have been getting the calling script ready for testing (and for production runs). Until recently we've done all these sorts of dumps using a variety of one-off scripts; I've started coverting them so that they all use the same mechanism. But a few additional changes were needed and have to be tested in order to run the corpora dumps. Once that's good to go, I'll be doing a full run manually using that script and will relay the results here (times, mem usage, etc).

Thanks for update @ArielGlenn Let us know any changes needed in the ContentTranslation script.

akosiaris changed the task status from Open to Stalled.Jun 7 2016, 8:39 AM
akosiaris subscribed.

Stalling for a while.

I'm currently running on snapshot1007 the following:

php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 10000 --outputdir /mnt/data/temp/ --compression gzip

Waiting to see what issues we encounter and/or what output is produced.

It produced one file:

-rw-rw-r-- 1 datasets datasets 98796066 Jul 27 09:15 cx-corpora._2_.html.json.gz

Is this expected?

Run time: about ten minutes.

I don't know how good is the compression ratio, but 100M compressed sounds a bit on the small side, but could be correct. The surprise is that it only created one file, I would expect some languages to cross the 10000 threshold and have separate files. If this is really so, we should consider making the threshold smaller.

Perhaps I should inspect the file manually to see if it is missing some content.

I had thought that running on one wiki generates all files; is that not true? At any rate, let me know a host where you have access and I'll put a copy of the file there.

Verified that file is visible on stat1003 in /mnt/data/temp

After discussion on IRC, rerunning with:

date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip ; date

start date/time is: Wed Jul 27 12:20:42 UTC 2016

I'll also try to debug myself with a bigger dataset as I suspect the script might be using excessive memory.

This time it gave me a fatal instead of silently exiting:

...

/mnt/data/tempcx-corpora.es2ca.html.json.gz
/mnt/data/temp
cx-corpora.en2ca.html.json.gz
/mnt/data/tempcx-corpora.fr2ca.html.json.gz
/mnt/data/temp
cx-corpora.no2nn.html.json.gz
/mnt/data/tempcx-corpora.en2pa.html.json.gz
/mnt/data/temp
cx-corpora.en2nb.html.json.gz
/mnt/data/tempcx-corpora.nn2nb.html.json.gz
/mnt/data/temp
cx-corpora.ru2uk.html.json.gz
/mnt/data/tempcx-corpora.en2uk.html.json.gz
/mnt/data/temp
cx-corpora.en2fr.html.json.gz
/mnt/data/tempcx-corpora.es2fr.html.json.gz
/mnt/data/temp
cx-corpora.en2vi.html.json.gz
/mnt/data/tempcx-corpora.ru2ba.html.json.gz
/mnt/data/temp
cx-corpora.es2gl.html.json.gz
/mnt/data/tempcx-corpora.es2ast.html.json.gz
/mnt/data/temp
cx-corpora.ru2kk.html.json.gz
/mnt/data/tempcx-corpora.en2el.html.json.gz
/mnt/data/temp
cx-corpora.en2cs.html.json.gz
/mnt/data/tempcx-corpora.en2sq.html.json.gz
/mnt/data/temp
cx-corpora.en2tr.html.json.gz
/mnt/data/tempcx-corpora.en2tl.html.json.gz
/mnt/data/temp
cx-corpora.en2pl.html.json.gz
/mnt/data/tempcx-corpora.en2sr.html.json.gz
/mnt/data/temp
cx-corpora.en2nl.html.json.gz
/mnt/data/tempcx-corpora.en2ro.html.json.gz
/mnt/data/temp
cx-corpora.en2bn.html.json.gz
/mnt/data/tempcx-corpora.en2th.html.json.gz
/mnt/data/temp
cx-corpora.en2ta.html.json.gz
/mnt/data/tempcx-corpora.en2ru.html.json.gz
/mnt/data/temp
cx-corpora.uk2ru.html.json.gz
/mnt/data/tempcx-corpora.en2de.html.json.gz
/mnt/data/temp
cx-corpora.en2ja.html.json.gz
/mnt/data/tempcx-corpora.en2ko.html.json.gz
/mnt/data/temp
cx-corpora.en2it.html.json.gz
/mnt/data/tempcx-corpora.en2he.html.json.gz
/mnt/data/temp
cx-corpora.en2zh.html.json.gz
/mnt/data/tempcx-corpora.en2fa.html.json.gz
/mnt/data/temp
cx-corpora.en2ar.html.json.gz
/mnt/data/tempcx-corpora.zh2hak.html.json.gz
/mnt/data/temp
cx-corpora._2hak.html.json.gz
/mnt/data/tempcx-corpora.fr2en.html.json.gz
/mnt/data/temp
cx-corpora.es2en.html.json.gz
Fatal error: Out of memory (allocated 9417523200) (tried to allocate 18446744071930920519 bytes) in /srv/mediawiki/php-1.28.0-wmf.11/includes/json/FormatJson.php on line 152
Wed Jul 27 12:27:10 UTC 2016

Lines in question:

149                 if ( $pretty !== false ) {
150                         // Workaround for <https://bugs.php.net/bug.php?id=66021>
151                         if ( $bug66021 ) {
152                                 $json = preg_replace( self::WS_CLEANUP_REGEX, '', $json );
153                         }

Doing a preg_replace in string in size of many megabytes is probably not the most efficient thing memory-wise. According to https://github.com/php/php-src/commit/82a4f1a1a287d9dbf01156bc14ceb13ccbf16d7a it is fixed since PHP 5.5.12 in PHP 5.5.x branch. The scripts is being run with PHP 5.5.9-1ubuntu4.17.

after more conversation on irc, trying

date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip --format tmx; date

to see if it completes.

Ended after a bit over ten minutes with this:

Wed Jul 27 12:42:38 UTC 2016
TMX output format is only supported with plaintext

Wed Jul 27 12:53:21 UTC 2016

Running again as:

date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip --format tmx --plaintext ; date

Output:

Wed Jul 27 13:16:51 UTC 2016
/mnt/data/tempcx-corpora.es2pt.text.tmx.gz
/mnt/data/temp
cx-corpora.en2pt.text.tmx.gz
/mnt/data/tempcx-corpora.en2id.text.tmx.gz
/mnt/data/temp
cx-corpora.ca2es.text.tmx.gz
/mnt/data/tempcx-corpora.en2es.text.tmx.gz
/mnt/data/temp
cx-corpora.pt2es.text.tmx.gz
/mnt/data/tempcx-corpora.fr2es.text.tmx.gz
/mnt/data/temp
cx-corpora.es2ca.text.tmx.gz
/mnt/data/tempcx-corpora.en2ca.text.tmx.gz
/mnt/data/temp
cx-corpora.fr2ca.text.tmx.gz
/mnt/data/tempcx-corpora.no2nn.text.tmx.gz
/mnt/data/temp
cx-corpora.en2pa.text.tmx.gz
/mnt/data/tempcx-corpora.en2nb.text.tmx.gz
/mnt/data/temp
cx-corpora.nn2nb.text.tmx.gz
/mnt/data/tempcx-corpora.ru2uk.text.tmx.gz
/mnt/data/temp
cx-corpora.en2uk.text.tmx.gz
/mnt/data/tempcx-corpora.en2fr.text.tmx.gz
/mnt/data/temp
cx-corpora.es2fr.text.tmx.gz
/mnt/data/tempcx-corpora.en2vi.text.tmx.gz
/mnt/data/temp
cx-corpora.ru2ba.text.tmx.gz
/mnt/data/tempcx-corpora.es2gl.text.tmx.gz
/mnt/data/temp
cx-corpora.es2ast.text.tmx.gz
/mnt/data/tempcx-corpora.ru2kk.text.tmx.gz
/mnt/data/temp
cx-corpora.en2el.text.tmx.gz
/mnt/data/tempcx-corpora.en2cs.text.tmx.gz
/mnt/data/temp
cx-corpora.en2sq.text.tmx.gz
/mnt/data/tempcx-corpora.en2tr.text.tmx.gz
/mnt/data/temp
cx-corpora.en2tl.text.tmx.gz
/mnt/data/tempcx-corpora.en2pl.text.tmx.gz
/mnt/data/temp
cx-corpora.en2sr.text.tmx.gz
/mnt/data/tempcx-corpora.en2nl.text.tmx.gz
/mnt/data/temp
cx-corpora.en2ro.text.tmx.gz
/mnt/data/tempcx-corpora.en2bn.text.tmx.gz
/mnt/data/temp
cx-corpora.en2th.text.tmx.gz
/mnt/data/tempcx-corpora.en2ta.text.tmx.gz
/mnt/data/temp
cx-corpora.en2ru.text.tmx.gz
/mnt/data/tempcx-corpora.uk2ru.text.tmx.gz
/mnt/data/temp
cx-corpora.en2de.text.tmx.gz
/mnt/data/tempcx-corpora.en2ja.text.tmx.gz
/mnt/data/temp
cx-corpora.en2ko.text.tmx.gz
/mnt/data/tempcx-corpora.en2it.text.tmx.gz
/mnt/data/temp
cx-corpora.en2he.text.tmx.gz
/mnt/data/tempcx-corpora.en2zh.text.tmx.gz
/mnt/data/temp
cx-corpora.en2fa.text.tmx.gz
/mnt/data/tempcx-corpora.en2ar.text.tmx.gz
/mnt/data/temp
cx-corpora.zh2hak.text.tmx.gz
/mnt/data/tempcx-corpora._2hak.text.tmx.gz
/mnt/data/temp
cx-corpora.fr2en.text.tmx.gz
/mnt/data/tempcx-corpora.es2en.text.tmx.gz
/mnt/data/temp
cx-corpora._2_.text.tmx.gz
Wed Jul 27 13:36:00 UTC 2016

Two questions: what is the last file? And, are all the contents present in the dump?

Change 301548 had a related patch set uploaded (by Nikerabbit):
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301548

The last file contains "everything else that did not get their own file already". I am checking to see if all expected content is in these files.

Nikerabbit changed the task status from Stalled to Open.Jul 28 2016, 7:16 AM

Change 301548 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301548

Change 301601 had a related patch set uploaded (by KartikMistry):
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301601

Change 301602 had a related patch set uploaded (by KartikMistry):
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301602

Change 301602 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301602

Change 301601 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions

https://gerrit.wikimedia.org/r/301601

datasets@snapshot1007:~$ date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip ; date
Thu Jul 28 15:27:53 UTC 2016
/mnt/data/tempcx-corpora.es2pt.html.json.gz
/mnt/data/temp
cx-corpora.en2pt.html.json.gz
/mnt/data/tempcx-corpora.en2id.html.json.gz
/mnt/data/temp
cx-corpora.ca2es.html.json.gz
/mnt/data/tempcx-corpora.en2es.html.json.gz
/mnt/data/temp
cx-corpora.pt2es.html.json.gz
/mnt/data/tempcx-corpora.fr2es.html.json.gz
/mnt/data/temp
cx-corpora.es2ca.html.json.gz
/mnt/data/tempcx-corpora.en2ca.html.json.gz
/mnt/data/temp
cx-corpora.fr2ca.html.json.gz
/mnt/data/tempcx-corpora.no2nn.html.json.gz
/mnt/data/temp
cx-corpora.en2pa.html.json.gz
/mnt/data/tempcx-corpora.en2nb.html.json.gz
/mnt/data/temp
cx-corpora.nn2nb.html.json.gz
/mnt/data/tempcx-corpora.ru2uk.html.json.gz
/mnt/data/temp
cx-corpora.en2uk.html.json.gz
/mnt/data/tempcx-corpora.en2fr.html.json.gz
/mnt/data/temp
cx-corpora.es2fr.html.json.gz
/mnt/data/tempcx-corpora.en2vi.html.json.gz
/mnt/data/temp
cx-corpora.ru2ba.html.json.gz
/mnt/data/tempcx-corpora.es2gl.html.json.gz
/mnt/data/temp
cx-corpora.es2ast.html.json.gz
/mnt/data/tempcx-corpora.ru2kk.html.json.gz
/mnt/data/temp
cx-corpora.en2el.html.json.gz
/mnt/data/tempcx-corpora.en2cs.html.json.gz
/mnt/data/temp
cx-corpora.en2sq.html.json.gz
/mnt/data/tempcx-corpora.en2tr.html.json.gz
/mnt/data/temp
cx-corpora.en2tl.html.json.gz
/mnt/data/tempcx-corpora.en2pl.html.json.gz
/mnt/data/temp
cx-corpora.en2sr.html.json.gz
/mnt/data/tempcx-corpora.en2nl.html.json.gz
/mnt/data/temp
cx-corpora.en2ro.html.json.gz
/mnt/data/tempcx-corpora.en2bn.html.json.gz
/mnt/data/temp
cx-corpora.en2th.html.json.gz
/mnt/data/tempcx-corpora.en2ta.html.json.gz
/mnt/data/temp
cx-corpora.en2ru.html.json.gz
/mnt/data/tempcx-corpora.uk2ru.html.json.gz
/mnt/data/temp
cx-corpora.en2de.html.json.gz
/mnt/data/tempcx-corpora.en2ja.html.json.gz
/mnt/data/temp
cx-corpora.en2ko.html.json.gz
/mnt/data/tempcx-corpora.en2it.html.json.gz
/mnt/data/temp
cx-corpora.en2he.html.json.gz
/mnt/data/tempcx-corpora.en2zh.html.json.gz
/mnt/data/temp
cx-corpora.en2fa.html.json.gz
/mnt/data/tempcx-corpora.en2ar.html.json.gz
/mnt/data/temp
cx-corpora.zh2hak.html.json.gz
/mnt/data/tempcx-corpora._2hak.html.json.gz
/mnt/data/temp
cx-corpora.fr2en.html.json.gz
/mnt/data/tempcx-corpora.es2en.html.json.gz
/mnt/data/temp
cx-corpora._2_.html.json.gz
Thu Jul 28 15:35:00 UTC 2016
datasets@snapshot1007:~$

@Nikerabbit can you verify that all the contents are there? same place as always, /mnt/data/temp

Number of unique drafts in the dump files: 51032 [1]
Number of published drafts in the database: 52921 [2]

About 2000 new things are published per week, so this does not quite explain the difference. My theory is that some of the drafts are actually empty (i.e. no user content at all, only machine translation or so) which we filter out:

// Some general cleanup
foreach ( $translation['corpora'] as $id => $unit ) {
	if ( !isset( $unit['user'] ) ) {
		unset( $translation['corpora'][$id] );
		continue;
	}

Same check for en2de gives 304 and 306, the two are new given they have bigger translation_id.

In addition [3] indicates that no file was truncated before compression. Thus as far as I can see these files are valid.

[1] zgrep -E '"id":"([^"]+)"' -o cx-corpora.*2*.html.json.gz | grep -E -o '[0-9]+/' | sort | uniq | wc -l
[2] select count(*) from cx_translations, cx_corpora where (translation_status = 'published' or translation_target_url is not null) and translation_id = cxc_translation_id group by translation_id;
[3] zgrep -E -L '\}\}\]$' *.html.json*

@ArielGlenn Do you see any more blockers to schedule json html, json plaintext and tmx plaintext dumps and make them available on dumps.wikimedia.org?

No, this looks great. I'll get going on that shortly.

Change 301773 had a related patch set uploaded (by ArielGlenn):
add cron job for Content Translation dumps

https://gerrit.wikimedia.org/r/301773

Can you folks add a -q flag to your maintenance script so it doesn't print out the names of the files when it runs properly? That should be the last thing I need, the standalone script seems to run ok.

The standard -q/--quiet flag implemented by the Maintenance class should work. Let me know is this is not the case.

ohhh nice, let me try that right now.

That did the trick, thanks! Please have a look at //gerrit.wikimedia.org/r/301773 and see if there's anything there that you don't like. I expect to merge some version of it within the next couple of days.

Change 301773 merged by ArielGlenn:
add cron job for Content Translation dumps

https://gerrit.wikimedia.org/r/301773

Merged. Set to run on Friday morning. I'll be watching.

Forgot to say it, woops! Run completed, looks ok, closing this ticket.