We need to create and make Content Translation Parallel Corpora dumps (see: T122042) available for public at: https://dumps.wikimedia.org/
- Script to provide dumps: scripts/dump-corpora.php
- Frequency: Weekly
We need to create and make Content Translation Parallel Corpora dumps (see: T122042) available for public at: https://dumps.wikimedia.org/
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • santhosh | T95886 for ContentTranslation MT, store information about source content, machine-translated content and user-edited content | |||
Resolved | • santhosh | T111905 Design the technical infrastructure for parallel corpora storage and api (tracking) | |||
Resolved | ArielGlenn | T127793 Create Content Translation Parallel Corpora dumps | |||
Resolved | Nikerabbit | T133006 Add compression support to scripts/dump-corpora.php | |||
Resolved | Nikerabbit | T133007 Add --output option to scripts/dump-corpora.php |
@ArielGlenn Did you test updated script? It is in production now (although on testwiki/group1 at the time of writing) and available for testing.
Not yet, but I have been getting the calling script ready for testing (and for production runs). Until recently we've done all these sorts of dumps using a variety of one-off scripts; I've started coverting them so that they all use the same mechanism. But a few additional changes were needed and have to be tested in order to run the corpora dumps. Once that's good to go, I'll be doing a full run manually using that script and will relay the results here (times, mem usage, etc).
Thanks for update @ArielGlenn Let us know any changes needed in the ContentTranslation script.
I'm currently running on snapshot1007 the following:
php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 10000 --outputdir /mnt/data/temp/ --compression gzip
Waiting to see what issues we encounter and/or what output is produced.
It produced one file:
-rw-rw-r-- 1 datasets datasets 98796066 Jul 27 09:15 cx-corpora._2_.html.json.gz
Is this expected?
Run time: about ten minutes.
I don't know how good is the compression ratio, but 100M compressed sounds a bit on the small side, but could be correct. The surprise is that it only created one file, I would expect some languages to cross the 10000 threshold and have separate files. If this is really so, we should consider making the threshold smaller.
Perhaps I should inspect the file manually to see if it is missing some content.
I had thought that running on one wiki generates all files; is that not true? At any rate, let me know a host where you have access and I'll put a copy of the file there.
After discussion on IRC, rerunning with:
date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip ; date
start date/time is: Wed Jul 27 12:20:42 UTC 2016
I'll also try to debug myself with a bigger dataset as I suspect the script might be using excessive memory.
This time it gave me a fatal instead of silently exiting:
...
/mnt/data/tempcx-corpora.es2ca.html.json.gz
/mnt/data/tempcx-corpora.en2ca.html.json.gz
/mnt/data/tempcx-corpora.fr2ca.html.json.gz
/mnt/data/tempcx-corpora.no2nn.html.json.gz
/mnt/data/tempcx-corpora.en2pa.html.json.gz
/mnt/data/tempcx-corpora.en2nb.html.json.gz
/mnt/data/tempcx-corpora.nn2nb.html.json.gz
/mnt/data/tempcx-corpora.ru2uk.html.json.gz
/mnt/data/tempcx-corpora.en2uk.html.json.gz
/mnt/data/tempcx-corpora.en2fr.html.json.gz
/mnt/data/tempcx-corpora.es2fr.html.json.gz
/mnt/data/tempcx-corpora.en2vi.html.json.gz
/mnt/data/tempcx-corpora.ru2ba.html.json.gz
/mnt/data/tempcx-corpora.es2gl.html.json.gz
/mnt/data/tempcx-corpora.es2ast.html.json.gz
/mnt/data/tempcx-corpora.ru2kk.html.json.gz
/mnt/data/tempcx-corpora.en2el.html.json.gz
/mnt/data/tempcx-corpora.en2cs.html.json.gz
/mnt/data/tempcx-corpora.en2sq.html.json.gz
/mnt/data/tempcx-corpora.en2tr.html.json.gz
/mnt/data/tempcx-corpora.en2tl.html.json.gz
/mnt/data/tempcx-corpora.en2pl.html.json.gz
/mnt/data/tempcx-corpora.en2sr.html.json.gz
/mnt/data/tempcx-corpora.en2nl.html.json.gz
/mnt/data/tempcx-corpora.en2ro.html.json.gz
/mnt/data/tempcx-corpora.en2bn.html.json.gz
/mnt/data/tempcx-corpora.en2th.html.json.gz
/mnt/data/tempcx-corpora.en2ta.html.json.gz
/mnt/data/tempcx-corpora.en2ru.html.json.gz
/mnt/data/tempcx-corpora.uk2ru.html.json.gz
/mnt/data/tempcx-corpora.en2de.html.json.gz
/mnt/data/tempcx-corpora.en2ja.html.json.gz
/mnt/data/tempcx-corpora.en2ko.html.json.gz
/mnt/data/tempcx-corpora.en2it.html.json.gz
/mnt/data/tempcx-corpora.en2he.html.json.gz
/mnt/data/tempcx-corpora.en2zh.html.json.gz
/mnt/data/tempcx-corpora.en2fa.html.json.gz
/mnt/data/tempcx-corpora.en2ar.html.json.gz
/mnt/data/tempcx-corpora.zh2hak.html.json.gz
/mnt/data/tempcx-corpora._2hak.html.json.gz
/mnt/data/tempcx-corpora.fr2en.html.json.gz
/mnt/data/tempcx-corpora.es2en.html.json.gz
Fatal error: Out of memory (allocated 9417523200) (tried to allocate 18446744071930920519 bytes) in /srv/mediawiki/php-1.28.0-wmf.11/includes/json/FormatJson.php on line 152
Wed Jul 27 12:27:10 UTC 2016
Lines in question:
149 if ( $pretty !== false ) { 150 // Workaround for <https://bugs.php.net/bug.php?id=66021> 151 if ( $bug66021 ) { 152 $json = preg_replace( self::WS_CLEANUP_REGEX, '', $json ); 153 }
Doing a preg_replace in string in size of many megabytes is probably not the most efficient thing memory-wise. According to https://github.com/php/php-src/commit/82a4f1a1a287d9dbf01156bc14ceb13ccbf16d7a it is fixed since PHP 5.5.12 in PHP 5.5.x branch. The scripts is being run with PHP 5.5.9-1ubuntu4.17.
after more conversation on irc, trying
date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip --format tmx; date
to see if it completes.
Ended after a bit over ten minutes with this:
Wed Jul 27 12:42:38 UTC 2016
TMX output format is only supported with plaintext
Wed Jul 27 12:53:21 UTC 2016
Running again as:
date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip --format tmx --plaintext ; date
Output:
Wed Jul 27 13:16:51 UTC 2016
/mnt/data/tempcx-corpora.es2pt.text.tmx.gz
/mnt/data/tempcx-corpora.en2pt.text.tmx.gz
/mnt/data/tempcx-corpora.en2id.text.tmx.gz
/mnt/data/tempcx-corpora.ca2es.text.tmx.gz
/mnt/data/tempcx-corpora.en2es.text.tmx.gz
/mnt/data/tempcx-corpora.pt2es.text.tmx.gz
/mnt/data/tempcx-corpora.fr2es.text.tmx.gz
/mnt/data/tempcx-corpora.es2ca.text.tmx.gz
/mnt/data/tempcx-corpora.en2ca.text.tmx.gz
/mnt/data/tempcx-corpora.fr2ca.text.tmx.gz
/mnt/data/tempcx-corpora.no2nn.text.tmx.gz
/mnt/data/tempcx-corpora.en2pa.text.tmx.gz
/mnt/data/tempcx-corpora.en2nb.text.tmx.gz
/mnt/data/tempcx-corpora.nn2nb.text.tmx.gz
/mnt/data/tempcx-corpora.ru2uk.text.tmx.gz
/mnt/data/tempcx-corpora.en2uk.text.tmx.gz
/mnt/data/tempcx-corpora.en2fr.text.tmx.gz
/mnt/data/tempcx-corpora.es2fr.text.tmx.gz
/mnt/data/tempcx-corpora.en2vi.text.tmx.gz
/mnt/data/tempcx-corpora.ru2ba.text.tmx.gz
/mnt/data/tempcx-corpora.es2gl.text.tmx.gz
/mnt/data/tempcx-corpora.es2ast.text.tmx.gz
/mnt/data/tempcx-corpora.ru2kk.text.tmx.gz
/mnt/data/tempcx-corpora.en2el.text.tmx.gz
/mnt/data/tempcx-corpora.en2cs.text.tmx.gz
/mnt/data/tempcx-corpora.en2sq.text.tmx.gz
/mnt/data/tempcx-corpora.en2tr.text.tmx.gz
/mnt/data/tempcx-corpora.en2tl.text.tmx.gz
/mnt/data/tempcx-corpora.en2pl.text.tmx.gz
/mnt/data/tempcx-corpora.en2sr.text.tmx.gz
/mnt/data/tempcx-corpora.en2nl.text.tmx.gz
/mnt/data/tempcx-corpora.en2ro.text.tmx.gz
/mnt/data/tempcx-corpora.en2bn.text.tmx.gz
/mnt/data/tempcx-corpora.en2th.text.tmx.gz
/mnt/data/tempcx-corpora.en2ta.text.tmx.gz
/mnt/data/tempcx-corpora.en2ru.text.tmx.gz
/mnt/data/tempcx-corpora.uk2ru.text.tmx.gz
/mnt/data/tempcx-corpora.en2de.text.tmx.gz
/mnt/data/tempcx-corpora.en2ja.text.tmx.gz
/mnt/data/tempcx-corpora.en2ko.text.tmx.gz
/mnt/data/tempcx-corpora.en2it.text.tmx.gz
/mnt/data/tempcx-corpora.en2he.text.tmx.gz
/mnt/data/tempcx-corpora.en2zh.text.tmx.gz
/mnt/data/tempcx-corpora.en2fa.text.tmx.gz
/mnt/data/tempcx-corpora.en2ar.text.tmx.gz
/mnt/data/tempcx-corpora.zh2hak.text.tmx.gz
/mnt/data/tempcx-corpora._2hak.text.tmx.gz
/mnt/data/tempcx-corpora.fr2en.text.tmx.gz
/mnt/data/tempcx-corpora.es2en.text.tmx.gz
/mnt/data/tempcx-corpora._2_.text.tmx.gz
Wed Jul 27 13:36:00 UTC 2016
Two questions: what is the last file? And, are all the contents present in the dump?
Change 301548 had a related patch set uploaded (by Nikerabbit):
DumpCorpora: Skip JSON formatting for old PHP versions
The last file contains "everything else that did not get their own file already". I am checking to see if all expected content is in these files.
Change 301548 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions
Change 301601 had a related patch set uploaded (by KartikMistry):
DumpCorpora: Skip JSON formatting for old PHP versions
Change 301602 had a related patch set uploaded (by KartikMistry):
DumpCorpora: Skip JSON formatting for old PHP versions
Change 301602 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions
Change 301601 merged by jenkins-bot:
DumpCorpora: Skip JSON formatting for old PHP versions
datasets@snapshot1007:~$ date; php5 /srv/mediawiki/multiversion/MWScript.php extensions/ContentTranslation/scripts/dump-corpora.php --wiki cawiki --split-at 500 --outputdir /mnt/data/temp/ --compression gzip ; date
Thu Jul 28 15:27:53 UTC 2016
/mnt/data/tempcx-corpora.es2pt.html.json.gz
/mnt/data/tempcx-corpora.en2pt.html.json.gz
/mnt/data/tempcx-corpora.en2id.html.json.gz
/mnt/data/tempcx-corpora.ca2es.html.json.gz
/mnt/data/tempcx-corpora.en2es.html.json.gz
/mnt/data/tempcx-corpora.pt2es.html.json.gz
/mnt/data/tempcx-corpora.fr2es.html.json.gz
/mnt/data/tempcx-corpora.es2ca.html.json.gz
/mnt/data/tempcx-corpora.en2ca.html.json.gz
/mnt/data/tempcx-corpora.fr2ca.html.json.gz
/mnt/data/tempcx-corpora.no2nn.html.json.gz
/mnt/data/tempcx-corpora.en2pa.html.json.gz
/mnt/data/tempcx-corpora.en2nb.html.json.gz
/mnt/data/tempcx-corpora.nn2nb.html.json.gz
/mnt/data/tempcx-corpora.ru2uk.html.json.gz
/mnt/data/tempcx-corpora.en2uk.html.json.gz
/mnt/data/tempcx-corpora.en2fr.html.json.gz
/mnt/data/tempcx-corpora.es2fr.html.json.gz
/mnt/data/tempcx-corpora.en2vi.html.json.gz
/mnt/data/tempcx-corpora.ru2ba.html.json.gz
/mnt/data/tempcx-corpora.es2gl.html.json.gz
/mnt/data/tempcx-corpora.es2ast.html.json.gz
/mnt/data/tempcx-corpora.ru2kk.html.json.gz
/mnt/data/tempcx-corpora.en2el.html.json.gz
/mnt/data/tempcx-corpora.en2cs.html.json.gz
/mnt/data/tempcx-corpora.en2sq.html.json.gz
/mnt/data/tempcx-corpora.en2tr.html.json.gz
/mnt/data/tempcx-corpora.en2tl.html.json.gz
/mnt/data/tempcx-corpora.en2pl.html.json.gz
/mnt/data/tempcx-corpora.en2sr.html.json.gz
/mnt/data/tempcx-corpora.en2nl.html.json.gz
/mnt/data/tempcx-corpora.en2ro.html.json.gz
/mnt/data/tempcx-corpora.en2bn.html.json.gz
/mnt/data/tempcx-corpora.en2th.html.json.gz
/mnt/data/tempcx-corpora.en2ta.html.json.gz
/mnt/data/tempcx-corpora.en2ru.html.json.gz
/mnt/data/tempcx-corpora.uk2ru.html.json.gz
/mnt/data/tempcx-corpora.en2de.html.json.gz
/mnt/data/tempcx-corpora.en2ja.html.json.gz
/mnt/data/tempcx-corpora.en2ko.html.json.gz
/mnt/data/tempcx-corpora.en2it.html.json.gz
/mnt/data/tempcx-corpora.en2he.html.json.gz
/mnt/data/tempcx-corpora.en2zh.html.json.gz
/mnt/data/tempcx-corpora.en2fa.html.json.gz
/mnt/data/tempcx-corpora.en2ar.html.json.gz
/mnt/data/tempcx-corpora.zh2hak.html.json.gz
/mnt/data/tempcx-corpora._2hak.html.json.gz
/mnt/data/tempcx-corpora.fr2en.html.json.gz
/mnt/data/tempcx-corpora.es2en.html.json.gz
/mnt/data/tempcx-corpora._2_.html.json.gz
Thu Jul 28 15:35:00 UTC 2016
datasets@snapshot1007:~$
@Nikerabbit can you verify that all the contents are there? same place as always, /mnt/data/temp
Number of unique drafts in the dump files: 51032 [1]
Number of published drafts in the database: 52921 [2]
About 2000 new things are published per week, so this does not quite explain the difference. My theory is that some of the drafts are actually empty (i.e. no user content at all, only machine translation or so) which we filter out:
// Some general cleanup foreach ( $translation['corpora'] as $id => $unit ) { if ( !isset( $unit['user'] ) ) { unset( $translation['corpora'][$id] ); continue; }
Same check for en2de gives 304 and 306, the two are new given they have bigger translation_id.
In addition [3] indicates that no file was truncated before compression. Thus as far as I can see these files are valid.
[1] zgrep -E '"id":"([^"]+)"' -o cx-corpora.*2*.html.json.gz | grep -E -o '[0-9]+/' | sort | uniq | wc -l
[2] select count(*) from cx_translations, cx_corpora where (translation_status = 'published' or translation_target_url is not null) and translation_id = cxc_translation_id group by translation_id;
[3] zgrep -E -L '\}\}\]$' *.html.json*
@ArielGlenn Do you see any more blockers to schedule json html, json plaintext and tmx plaintext dumps and make them available on dumps.wikimedia.org?
Change 301773 had a related patch set uploaded (by ArielGlenn):
add cron job for Content Translation dumps
Can you folks add a -q flag to your maintenance script so it doesn't print out the names of the files when it runs properly? That should be the last thing I need, the standalone script seems to run ok.
The standard -q/--quiet flag implemented by the Maintenance class should work. Let me know is this is not the case.
That did the trick, thanks! Please have a look at //gerrit.wikimedia.org/r/301773 and see if there's anything there that you don't like. I expect to merge some version of it within the next couple of days.