Page MenuHomePhabricator

OOM problems with snapshot1007 due to wikidata dumpRdf jobs taking all 60GB+ of main memory
Closed, ResolvedPublic

Description

Was looking into why a number of cirrus dumps this week came out with 0 size (well, 20 but thats just the compression header). From snapshot1007.eqiad.wmnet:/var/log/cirrusdump/cirrusdump-alswiki-20170327-cirrussearch-content.log:

could not allocate 218103807 bytes for translation cache
[Mon Mar 27 16:16:46 2017] [hphp] [24518:7f1008f43100:0:000001] [] Lost parent, LightProcess exiting
[Mon Mar 27 16:16:46 2017] [hphp] [24515:7f1008f43100:0:000001] [] Lost parent, LightProcess exiting

dmesg doesn't have that specific entry, but i see dmesg logs for killing hhvm, php5 and python throughout the day.
Pastie of one such event: P5145

The dmesg output unfortunately only says the program, and not the various command line flags used to run the script. Looking at the machine right now there are a few processes using whats seems like an outsized amount of memory:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                     
10045 datasets  20   0 8191576 4.729g   3948 R  91.9  7.5  57:09.34 php                                                                                                                                                         
10038 datasets  20   0 8162064 4.726g   3960 R  90.1  7.5  57:19.45 php                                                                                                                                                         
10047 datasets  20   0 8146932 4.714g   3960 R  86.4  7.5  56:59.44 php                                                                                                                                                         
10036 datasets  20   0 8144216 4.714g   3948 S  93.8  7.5  57:33.15 php                                                                                                                                                         
10042 datasets  20   0 8081348 4.645g   3952 S  80.9  7.4  57:15.63 php

Those all belong to this process tree:

root     10021  0.0  0.0  59760   460 ?        S    Mar27   0:00  \_ CRON
datasets 10022  0.0  0.0   4440   548 ?        Ss   Mar27   0:00      \_ /bin/sh -c /usr/local/bin/dumpwikidatattl.sh
datasets 10023  0.0  0.0  12436   920 ?        S    Mar27   0:00          \_ /bin/bash /usr/local/bin/dumpwikidatattl.sh
datasets 10034  0.0  0.0  12432   320 ?        S    Mar27   0:00              \_ /bin/bash /usr/local/bin/dumpwikidatattl.sh
datasets 10038 87.0  7.7 8334096 5127396 ?     R    Mar27  58:41              |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 5 --format ttl
datasets 10056  0.0  0.0 381892 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 5 --format ttl
datasets 10063  0.0  0.0 381892 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 5 --format ttl
datasets 10069  0.0  0.0 381892 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 5 --format ttl
datasets 10074  0.0  0.0 381892 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 5 --format ttl
datasets 10078  0.0  0.0 381892 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 5 --format ttl
datasets 10040  3.6  0.0   4596   788 ?        S    Mar27   2:29              |   \_ gzip
datasets 10035  0.0  0.0  12432   320 ?        S    Mar27   0:00              \_ /bin/bash /usr/local/bin/dumpwikidatattl.sh
datasets 10036 87.4  7.7 8316248 5115320 ?     R    Mar27  58:55              |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 5 --format ttl
datasets 10054  0.0  0.0 382448 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 5 --format ttl
datasets 10060  0.0  0.0 382448 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 5 --format ttl
datasets 10065  0.0  0.0 382448 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 5 --format ttl
datasets 10071  0.0  0.0 382448 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 5 --format ttl
datasets 10077  0.0  0.0 382448 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 5 --format ttl
datasets 10039  3.7  0.0   4596   788 ?        S    Mar27   2:29              |   \_ gzip
datasets 10037  0.0  0.0  12432   320 ?        S    Mar27   0:00              \_ /bin/bash /usr/local/bin/dumpwikidatattl.sh
datasets 10042 86.9  7.6 8253380 5043028 ?     R    Mar27  58:38              |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 5 --format ttl
datasets 10058  0.0  0.0 381024 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 5 --format ttl
datasets 10062  0.0  0.0 381024 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 5 --format ttl
datasets 10066  0.0  0.0 381024 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 5 --format ttl
datasets 10072  0.0  0.0 381024 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 5 --format ttl
datasets 10075  0.0  0.0 381024 43760 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 5 --format ttl
datasets 10043  3.6  0.0   4596   784 ?        S    Mar27   2:29              |   \_ gzip
datasets 10041  0.0  0.0  12432   320 ?        S    Mar27   0:00              \_ /bin/bash /usr/local/bin/dumpwikidatattl.sh
datasets 10047 86.5  7.7 8314868 5111292 ?     R    Mar27  58:20              |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 5 --format ttl
datasets 10055  0.0  0.0 381068 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 5 --format ttl
datasets 10059  0.0  0.0 381068 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 5 --format ttl
datasets 10064  0.0  0.0 381068 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 5 --format ttl
datasets 10068  0.0  0.0 381068 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 5 --format ttl
datasets 10070  0.0  0.0 381068 43764 ?        Ss   Mar27   0:00              |   |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 5 --format ttl
datasets 10048  3.7  0.0   4596   788 ?        S    Mar27   2:29              |   \_ gzip
datasets 10044  0.0  0.0  12432   320 ?        S    Mar27   0:00              \_ /bin/bash /usr/local/bin/dumpwikidatattl.sh
datasets 10045 86.8  7.7 8359512 5129128 ?     R    Mar27  58:30                  \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 5 --format ttl
datasets 10057  0.0  0.0 380660 41780 ?        Ss   Mar27   0:00                  |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 5 --format ttl
datasets 10061  0.0  0.0 380660 41780 ?        Ss   Mar27   0:00                  |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 5 --format ttl
datasets 10067  0.0  0.0 380660 41780 ?        Ss   Mar27   0:00                  |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 5 --format ttl
datasets 10073  0.0  0.0 380660 41780 ?        Ss   Mar27   0:00                  |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 5 --format ttl
datasets 10076  0.0  0.0 380660 41780 ?        Ss   Mar27   0:00                  |   \_ php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 5 --format ttl
datasets 10046  3.7  0.0   4596   784 ?        S    Mar27   2:30                  \_ gzip

I don't know for sure its the rdf dumping causing this, but it seems a likely culprit

Event Timeline

I dont know if its part of the plan for the new dumps infrastructure, but some way to isolate dumps would be nice, such that one dump can't take up all the resources and break other dumps.

EBernhardson added a comment.EditedMar 28 2017, 12:34 AM

It looks like long running cirrussearch dumps, such as enwiki, might have a similar problem although not as pronounced. In monitoring over about 20 minutes the CirrusSearch dump job for enwiki grew from 1.1G to 1.2G, and the wikidata dumps grew from 4.7G to 7.3G. Randomly guessing, this could be logs and stats generation that isn't being regularly flushed, because it was designed for web requests that end after a few hundred ms.

Almost certain now there is a memory leak in at least the wikidata dumpRdf. There might be a general mediawiki maintenance script leak, not sure.

@hoo Not sure if you are the right person to ping, but you are marked as last changing the cron script that runs these so seems like a good place to start. Could you look into this or pass it on to someone at wmde that could?

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                     
40100 datasets  20   0 38.657g 0.028t   4128 R  64.3 45.0 410:36.42 php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 5 --format ttl
                                                                                                                                                         
40108 datasets  20   0 38.850g 0.030t   3960 R  55.4 48.5 410:56.87 php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 5 --format ttl
EBernhardson renamed this task from OOM(?) problems with snapshot1007 to OOM problems with snapshot1007 due to wikidata dumpRdf jobs taking all 60GB+ of main memory.Mar 28 2017, 2:21 PM

Change 345170 had a related patch set uploaded (by EBernhardson):
[operations/puppet@production] Prevent wikidata dumps from taking all memory on snapshot1007

https://gerrit.wikimedia.org/r/345170

Provided patch is a plausible solution to preventing this one dump from causing cascading errors in other dumps. It's a rather incomplete solution though, as the memory limit is per process and this dump spawns 30 php processes. I think cgroups might be a more complete solution but would like feedback before working up a solution using cgroups.

hoo added a comment.Mar 29 2017, 9:14 AM

this dump spawns 30 php processes

It shouldn't and probably doesn't: It spawns 5 processes, though these processes might have multiple threads each (this is HHVM in the end).

To rule out a memory leak in hhvm we might also want to run the script with Zend once.

In addition, snapshot1007 is used for other things simultaneously (e.g. xml/sql dump generation, like the other snapshot hosts). We leave some cores and memory unused for cron jobs like the wikidata dumps, but 30 processes at once is not manageable. 5 or 6 is what we'd like to see. In the meantime I'll peek at the dmpRdf script and see if anything leaps out, though I hope those who know more will jump in soon.

hoo added a comment.Mar 29 2017, 9:20 AM

I just started one Rdf shard dump (to /dev/null) with Zend and one with HHVM. Let's see how that goes.

Sorry, I didn't see your replies @hoo until after I posted my comment. Let's see what your test run shows.

hoo added a comment.Mar 29 2017, 10:50 AM

HHVM on snapshot1007:
22079 datasets 20 0 10.522g 9.948g 22216 R 88.5 15.8 78:49.10 php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --sharding-factor 5 --shard 0 --snippet
Processed 424796 entities.

Zend on terbium:
16514 www-data 20 0 494840 217436 10144 R 86.9 0.7 82:10.50 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --sharding-factor 5 --shard 0 --snippet
Processed 313032 entities.

Both processes were started at about the same time, they might have used different database slaves though (that might affect the speed, but not the memory leak).

Change 345303 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Use Zend php to create the Wikidata entity dumps

https://gerrit.wikimedia.org/r/345303

Change 345303 merged by ArielGlenn:
[operations/puppet@production] Use Zend php to create the Wikidata entity dumps

https://gerrit.wikimedia.org/r/345303

hoo added a comment.Mar 29 2017, 11:01 AM

Snapshot1007 currently has:

hoo@snapshot1007:~$ hhvm --version
HipHop VM 3.12.7 (rel)
Compiler: 3.12.7+dfsg-1+wmf1~trusty1
Repo schema: a28132f015144d07eea2af17c422c8d422b7111e
hoo closed this task as Resolved.Apr 4 2017, 11:53 AM
hoo claimed this task.

The immediate issue in question has been resolved by switching our dumpers back to Zend php.

For fixing the actual memory leak, please see T161695: Wikidata dump maintenance scripts cause HHVM to leak memory heavily, not doing GC.

Change 345170 abandoned by EBernhardson:
Prevent wikidata dumps from taking all memory on snapshot1007

https://gerrit.wikimedia.org/r/345170

ArielGlenn moved this task from Backlog to Done on the Dumps-Generation board.Jun 19 2017, 9:37 AM