Page MenuHomePhabricator

Wikidata entity dumpers stuck with 100% CPU on snapshot1007
Closed, ResolvedPublic

Description

On snapshot1007:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                                                                                                                                                                                     
60977 datasets  20   0 7195284 6.499g   3684 R 100.0 10.3   5777:51 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60983 datasets  20   0 7199524 6.502g   3676 R 100.0 10.4   5809:59 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60971 datasets  20   0 7199644 6.504g   3684 R  94.9 10.4   5821:09 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60980 datasets  20   0 7189028 6.491g   3684 R  94.9 10.3   5798:56 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60986 datasets  20   0 7198460 6.501g   3684 R  94.9 10.4   5784:49 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache                                                                                                                                                                                                                          
60989 datasets  20   0 7190612 6.493g   3684 R  94.9 10.3   5766:51 php5 /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 5 --sharding-factor 6 --batch-size 1500 --format nt --flavor truthy-dump --no-cache

They got pretty far before getting stuck:

$ tail -n1 dumpwikidatant-wikidata-20171122-truthy-BETA-*
==> dumpwikidatant-wikidata-20171122-truthy-BETA-0.log <==
Processed 6776250 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-1.log <==
Processed 6772414 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-2.log <==
Processed 6775991 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-3.log <==
Processed 6768160 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-4.log <==
Processed 6774085 entities.

==> dumpwikidatant-wikidata-20171122-truthy-BETA-5.log <==
Processed 6766093 entities.

Event Timeline

hoo created this task.Nov 27 2017, 11:37 AM

Here are three perf top screenshots in order, with a few minutes between the first and last one.

Mentioned in SAL (#wikimedia-operations) [2017-11-27T18:08:26Z] <hoo> Killed truthy nt dumpers stuck at 100% CPU (from last week) on snapshot1007 (T181385)

hoo added a comment.Nov 27 2017, 6:10 PM

Not a single entity managed to get dumped since I created this ticket more than 7h ago, thus I killed all related processes now.
Given that we will have a new dump this week anyway, I wont bother re-starting it manually.

hoo added a comment.Nov 27 2017, 6:28 PM

I just tested this briefly locally by dumping my wiki several times (up to 100 times), but I couldn't see any kind of memory leak (but again, I dumped the same entities several times).

Tomorrow I'll test this with actual data.

hoo added a comment.Nov 28 2017, 11:57 PM

So I just ran one JSON dumper with --no-cache and one without on mwdebug1001 (with HHVM)… the results are baffling:

$ head -n10 dumpJson*.stderr 
==> dumpJson-no-cache.stderr <==
Dumping shard 0/4
memory_get_usage(): 18874368
Processed 23 entities.
memory_get_usage(): 25165824
Processed 41 entities.
memory_get_usage(): 33554432
Processed 60 entities.
memory_get_usage(): 35651584
Processed 81 entities.
memory_get_usage(): 37748736

==> dumpJson.stderr <==
Dumping shard 0/4
memory_get_usage(): 18874368
Processed 23 entities.
memory_get_usage(): 23068672
Processed 41 entities.
memory_get_usage(): 27262976
Processed 60 entities.
memory_get_usage(): 29360128
Processed 81 entities.
memory_get_usage(): 29360128

$ tail -n10 dumpJson*.stderr 
==> dumpJson-no-cache.stderr <==
memory_get_usage(): 553648128
Processed 19902 entities.
memory_get_usage(): 553648128
Processed 19924 entities.
memory_get_usage(): 553648128
Processed 19947 entities.
memory_get_usage(): 555745280
Processed 19977 entities.
memory_get_usage(): 555745280
Processed 20000 entities.

==> dumpJson.stderr <==
memory_get_usage(): 62914560
Processed 19902 entities.
memory_get_usage(): 62914560
Processed 19924 entities.
memory_get_usage(): 62914560
Processed 19947 entities.
memory_get_usage(): 62914560
Processed 19977 entities.
memory_get_usage(): 62914560
Processed 20000 entities.

I'll look into this more tomorrow… I have no idea how this could be happening.

Mentioned in SAL (#wikimedia-operations) [2017-11-28T23:58:16Z] <hoo> Ran scap pull on mwdebug1001 after T181385 related testing

Mentioned in SAL (#wikimedia-operations) [2017-12-01T00:04:47Z] <hoo> Killed all remaining Wikidata JSON/RDF dumpers, due to T181385. This means no dumps this week!

hoo raised the priority of this task from High to Unbreak Now!.Dec 1 2017, 12:05 AM
Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptDec 1 2017, 12:05 AM
hoo added a comment.Dec 1 2017, 12:48 AM

When changing CacheRetrievingEntityRevisionLookup to always use it's underlying EntityRevisionLookup ($this->lookup) on mwdebug1001, both the call with --no-cache and without it show very similar memory usage behavior.

hoo added a comment.Dec 1 2017, 12:52 AM

Reverting 795350da2e5c49efa66c1950bd034f46aeb3768a also doesn't seem to make any difference.

hoo added a comment.Dec 1 2017, 1:31 AM

When changing CacheRetrievingEntityRevisionLookup to always use it's underlying EntityRevisionLookup ($this->lookup) on mwdebug1001, both the call with --no-cache and without it show very similar memory usage behavior.

This suggest that removing the no-cache wont have any effect on Zend at all (as the hit rate on Zend is 0 anyway).

I tried digging into this some more, but couldn't find anything useful, yet. I skimmed the whole wmf.6 -> wmf.8 Wikidata and MediaWiki diff and couldn't see anything related/ or I manually reverted the related bits on mwdebug1001 without success.

Mentioned in SAL (#wikimedia-operations) [2017-12-01T01:46:52Z] <hoo> Ran scap pull on mwdebug1001 after T181385 testing

hoo renamed this task from Wikidata truthy nt dumpers stuck with 100% CPU on snapshot1007 to Wikidata entity dumpers stuck with 100% CPU on snapshot1007.Dec 1 2017, 2:47 PM
hoo added a comment.Dec 2 2017, 3:10 AM

So I finally managed to track this down:

In Revision::getRevisionText, MainWANObjectCache::getWithSetCallback is used to retrieve revisions.

As of 4432e898be078eea73fbf072f5a383ec1ded0d67 WANObjectCache uses a StatsdDataFactoryInterface for certain kinds of (often very high volume) logging. Given MediaWiki uses a BufferingStatsdDataFactory per default, we basically try to accumulate all these stat hits until we eventually run out of memory.

Just commenting out $this->buffer[] = $entity; in BufferingStatsdDataFactory::produceStatsdData (locally on snapshot1001) immediately fixed this.

hoo assigned this task to aaron.Dec 2 2017, 3:37 AM
hoo added subscribers: demon, Krinkle, Addshore, Legoktm.

I don't see a clear way to fix, thus reassigning to Aaron.

  • We could either set a (even obsessively high) maximum buffer size in BufferingStatsdDataFactory.
  • We could remove the tracking from WANObjectCache :/
  • We could make sure the buffer in BufferingStatsdDataFactory is send out more often… this would probably require dirty changes to BufferingStatsdDataFactory or a decorator that takes care of this?

Mentioned in SAL (#wikimedia-operations) [2017-12-02T03:51:21Z] <hoo> Ran "scap pull" on snapshot1001, after final T181385 tests

Change 394758 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] Wikidata weekly json and rdf dumps disabled temporarily

https://gerrit.wikimedia.org/r/394758

Change 394758 merged by ArielGlenn:
[operations/puppet@production] Wikidata weekly json and rdf dumps disabled temporarily

https://gerrit.wikimedia.org/r/394758

aaron added a comment.Dec 2 2017, 8:18 PM

How long do these run? The sample rate in config is set to be extremely low. So perhaps:

  • The buffering class buffers things that won't even be saved
  • The buffering could be disable in CLI mode

Change 394772 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/394772

Change 394779 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts

https://gerrit.wikimedia.org/r/394779

aaron added a comment.Dec 2 2017, 9:20 PM

Change 394779 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts
https://gerrit.wikimedia.org/r/394779

This seems more general and preferable than the other patch, but leaving both here to look at.

hoo added a comment.Dec 2 2017, 9:41 PM

How long do these run? The sample rate in config is set to be extremely low. So perhaps:

  • The buffering class buffers things that won't even be saved

That's definitely the case here, given that the sample rate is only applied later on.

Envlh added a subscriber: Envlh.Dec 4 2017, 4:42 PM

Change 394772 merged by jenkins-bot:
[mediawiki/core@master] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/394772

hoo added a comment.Dec 4 2017, 6:01 PM

So I merged the patch that makes sure we only actually collect this data if not in CLI mode.

I also tried putting MediaWikiServices::getInstance()->getStatsdDataFactory()->setEnabled( false ); in DumpScript::__construct, but this only had limited effect. For that to properly work, BufferingStatsdDataFactory::produceStatsdData needs to normalize and set the entity's key before aborting. Any idea why that is @aaron?

Change 395055 had a related patch set uploaded (by Hoo man; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.10] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/395055

Change 395055 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.10] Only send statsd data for WAN cache in non-CLI mode

https://gerrit.wikimedia.org/r/395055

Mentioned in SAL (#wikimedia-operations) [2017-12-04T18:35:16Z] <hoo@tin> Synchronized php-1.31.0-wmf.10/includes/objectcache/ObjectCache.php: Only send statsd data for WAN cache in non-CLI mode (T181385) (duration: 00m 44s)

Mentioned in SAL (#wikimedia-operations) [2017-12-04T18:40:40Z] <hoo> Ran scap pull on snapshot1001 (T181385)

Mentioned in SAL (#wikimedia-operations) [2017-12-04T18:46:40Z] <hoo> Started dumpwikidatajson.sh on snapshot1007 (T181385)

hoo lowered the priority of this task from Unbreak Now! to High.Dec 4 2017, 6:48 PM

Downgrading this, as this seems to be working again.

Keeping open unless we have re-enabled the cron after we saw this actually worked.

aaron moved this task from Inbox to Doing on the Performance-Team board.Dec 4 2017, 9:00 PM

We had similar problems with long-running scripts and statsd here: T181716: forceSearchIndex.php hangs at the end of the process when running on large wikis. I think combinantion of statsd and long-running maintenance scripts is not working well with some recent change there. Hopefully limiting it to non-CLI will fix it.

Envlh awarded a token.Dec 6 2017, 8:50 AM
hoo closed this task as Resolved.Dec 8 2017, 12:47 AM
hoo claimed this task.
hoo removed a project: Patch-For-Review.
hoo added a subscriber: aaron.

Dump crons are back (af1f7dabee931dbdb7366b7be1f93698f2c56108), so I expect dumps to arrive in time next week!

Change 394779 merged by jenkins-bot:
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts

https://gerrit.wikimedia.org/r/394779