Page MenuHomePhabricator

Content translation dumps failing due to excessive memory usage
Closed, ResolvedPublic

Description

I just recieved cron mail:

Warning: rename(/tmp/conf-1.31.0-wmf.7-enwiki8QEKIz,/tmp/mw-cache-1.31.0-wmf.7/conf-enwiki): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 220
Warning: proc_open(): fork failed - Cannot allocate memory in /srv/mediawiki/php-1.31.0-wmf.7/includes/export/DumpPipeOutput.php on line 68
Warning: fputs() expects parameter 1 to be resource, null given in /srv/mediawiki/php-1.31.0-wmf.7/includes/export/DumpFileOutput.php on line 55
Warning: proc_open(): fork failed - Cannot allocate memory in /srv/mediawiki/php-1.31.0-wmf.7/includes/export/DumpPipeOutput.php on line 68

Fatal error: Out of memory (allocated 32825933824) (tried to allocate 18446744072219475148 bytes) in /srv/mediawiki/php-1.31.0-wmf.7/vendor/monolog/monolog/src/Monolog/Processor/PsrLogMessageProcessor.php on line 44

Warning: rename(/tmp/conf-1.31.0-wmf.7-enwikijknz1n,/tmp/mw-cache-1.31.0-wmf.7/conf-enwiki): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 220
Warning: rename(/tmp/conf-1.31.0-wmf.7-enwikiYKstnc,/tmp/mw-cache-1.31.0-wmf.7/conf-enwiki): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 220

The warnings can be ignored, the fatal cannot.

Event Timeline

Was new code deployed since the previous cron run? This job still runs on the dataset1001 host as the same old user, so nothing's changed as far as that goes.

Tagging @Nikerabbit who worked on helping to set up the cron job; if you're the wrong person, feel free to remove yourself and point me at someone more appropriate.

There has been no recent changes to the code. 32825933824 is 30G and the latter number is just huge. Might be just regular growth exposing the memory inefficiency that has been present since beginning.

Can we split the job into separate pieces that use less memory?

I have to check how it was done. I suppose serializing into one big JSON (or XML) is problematic. Could use streaming approach and/or split into multiple files.

@ArielGlenn Is it possible to check which of the export commands this is? Like, if it is the XML version, then maybe I shouldn't spend time looking for streaming json encoders.

The cron job produces only the output pasted above. I can tell you that the last file updated in the contenttranslation directory is cx-corpora._2_.text.tmx.gz

Also, here's a count of the file types in there, in case you can see what's missing:

$ ls *gz | grep html.json | wc -l
76
$ ls *gz | grep text.json | wc -l
78
$ ls *gz | grep text.tmx | wc -l
78

$ ls *gz | egrep -v '(html.json|text.tmx|text.json)' | wc -l
0

Maybe I should add some output the script.

So far my guess is that cx-corpora._2_.html.json.gz and one other file is missing. This makes sense at least because those are the largest files:

https://dumps.wikimedia.org/other/contenttranslation/20171027/

126490349 27-Oct-2017 10:03 cx-corpora.en2fr.html.json.gz
157617537 27-Oct-2017 10:01 cx-corpora.en2es.html.json.gz
310747808 27-Oct-2017 10:10 cx-corpora._2_.html.json.gz

This indeed seems to be issue with json_encode. One streaming encoder library I found is https://github.com/violet-php/streaming-json-encoder

Do you happen to know if anyone else is producing json formatted dumps and whether they are using anything other than json_ecode?.

The wikidata folks dump json, but they seem to use json_encode also, as I see looking at Wikibase/repo/includes/Dumpers/JsonDumpGenerator.php (https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/repo/includes/Dumpers/JsonDumpGenerator.php).

Change 391001 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/ContentTranslation@master] Tweak dump-corpora.php

https://gerrit.wikimedia.org/r/391001

I am not sure my fix is enough. What is the best way to test it? Get it merged and SWAT deployed and wait for next run?

That seems best to me, as long as you know it's not introducing a bug. Are we sure there's no literal newlines anywhere in the strings you are processing, for example?

That seems best to me, as long as you know it's not introducing a bug. Are we sure there's no literal newlines anywhere in the strings you are processing, for example?

There are none in my small test set. In addition JSON standard mandates that newlines inside strings must be escaped, so it should not happen ever with JSON.

In addition, extra tab inside the actual strings wouldn't be a big deal in this context.

Change 391001 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Tweak dump-corpora.php

https://gerrit.wikimedia.org/r/391001

Next automatic dump will happen on Friday. I'll assume @ArielGlenn will post an update here after that.

The results of today's run are in:

Warning: rename(/tmp/conf-1.31.0-wmf.7-enwiki7YuPVF,/tmp/mw-cache-1.31.0-wmf.7/conf-enwiki): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 220
Warning: proc_open(): fork failed - Cannot allocate memory in /srv/mediawiki/php-1.31.0-wmf.7/includes/export/DumpPipeOutput.php on line 68
Fatal error: Out of memory (allocated 33141030912) (tried to allocate 18446744072244403560 bytes) in /srv/mediawiki/php-1.31.0-wmf.7/vendor/monolog/monolog/src/Monolog/Processor/PsrLogMessageProcessor.php on line 44
Warning: rename(/tmp/conf-1.31.0-wmf.7-enwikiZNZBql,/tmp/mw-cache-1.31.0-wmf.7/conf-enwiki): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 220
Warning: rename(/tmp/conf-1.31.0-wmf.7-enwikiIdI8zL,/tmp/mw-cache-1.31.0-wmf.7/conf-enwiki): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 220

I was expecting at least some kind of change, it seems as if nothing changed.

Change 393212 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/ContentTranslation@master] Make dump-corpora.php streaming to reduce memory usage

https://gerrit.wikimedia.org/r/393212

Nikerabbit renamed this task from Content translation dump broken, investigate to Content translation dumps failing due to excessive memory usage.Nov 24 2017, 3:33 PM

Change 393212 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Make dump-corpora.php streaming to reduce memory usage

https://gerrit.wikimedia.org/r/393212

Results after deployement of the above fix:

/data/xmldatadumps/public/other/contenttranslation/20171201$ ls | wc -l
314

-rw-rw-r-- 1 datasets datasets  48936597 Dec  1 20:55 cx-corpora._2_.text.tmx.gz      <-- last file written

There are 209 json files in there along with everything else. No error email from cron, so it looks to me like this problem has been resolved. Do you want to have a look at https://dumps.wikimedia.org/other/contenttranslation/20171201/ and see if everything is there that you expect?

cx-corpora.de2en.html.json.gz seems to be missing, which is weird, given the text.json version exists.

There are no related messages in the logs anywhere, so that is weird.

I propose I create a new task to monitor the dumps for a week or two to see if it is a consistent issue. We no longer seem to have memory issues and are able to produce dumps so I would call this fixed.

Fine by me, any idea about the missing file?

Nikerabbit moved this task from QA to Done on the Language-2017-Oct-Dec board.