Page MenuHomePhabricator

[Tracking] Wikidata entity dumpers need to cope with the immense Wikidata growth recently
Closed, ResolvedPublic

Description

Given Wikidata is currently grows at 3-10% a week, we need to make the Wikidata entity dumpers keep up with that.

The changes in batch size (4eedfb48e9fdc93eea13d9fd3bd341e66c1abfbc) and https://github.com/wmde/WikibaseDataModel/pull/762 will already ease some of the pain, but given the immense growth, this can probably hardly offset four weeks of Wikidata growth.

Possible things to do:

  • Create a "master dump" (or some such) which all other dumps can be derived from (this will ease the pain on the DBs, but hardly considering CPU time)
  • Increase the number of runners further (from 5 currently) https://gerrit.wikimedia.org/r/383414
  • Try to derive old dumps from new ones (not quite easy to do and not sure how much to gain here)
  • Do more profiling and try to find more low-hanging fruits (like the examples above, or T157013)
  • Switch away from PHP5 to PHP7 or HHVM (also see the related discussion at T172165)
  • Find the right --batch-size (https://gerrit.wikimedia.org/r/384204)

Patch-For-Review/ TODOs:

I consider this task done when the dumps finish no later than mid-Thursday again and don't run well into the weekend.

Event Timeline

Change 383414 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Increase the shard count for Wikidata entity dumps from 5 to 6

https://gerrit.wikimedia.org/r/383414

Even though the growth has reduced to just 1.3% this week, the dump still seems to be substantially slower… since we deployed https://gerrit.wikimedia.org/r/380628 the dumps have been at least a few hours slower :/

I have no evidence that supports that https://gerrit.wikimedia.org/r/380628 might be harmful here, but we/ I might need to investigate further.

Change 384204 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Test different batch sizes in dumpwikidatajson.sh

https://gerrit.wikimedia.org/r/384204

T178247: Use a retrieve only CachingEntityRevisionLookup for dumps should give us another few percent, as it makes loading and unserializing entities up to 15% faster. I haven't specifically tested that in the context of the dumps, but based on old profiling data, this should give us up to about 8-9% speedup for dumps (as getEntity takes 58-60% of the dump time and we made it up to 15% faster).

Change 383414 merged by ArielGlenn:
[operations/puppet@production] Increase the shard count for Wikidata entity dumps from 5 to 6

https://gerrit.wikimedia.org/r/383414

Change 384204 merged by ArielGlenn:
[operations/puppet@production] Test different batch sizes in dumpwikidatajson.sh

https://gerrit.wikimedia.org/r/384204

hoo updated the task description. (Show Details)

FYI: run completed,

Oct 20 14:34 latest-truthy.nt.bz2 -> 20171018/wikidata-20171018-truthy-BETA.nt.bz2

So not quite the boost you were looking for.

@ArielGlenn, did you missed to post the runtime of the job?

Note that the changes to the Wikibase-DataModel are not released yet. I think we can't easily backport the changes to the 7.x branch because other changes are made to the same classes, but maybe we should try to do that just to get an early release. @hoo?

Note that the changes to the Wikibase-DataModel are not released yet. I think we can't easily backport the changes to the 7.x branch because other changes are made to the same classes, but maybe we should try to do that just to get an early release. @hoo?

I can try to do that on Monday… both changes should actually backport quite well.

Please keep in mind that the only changes we yet deployed are the increase in the number of shards and the batch-size changes (which might not actually have much of an effect :/).

@ArielGlenn, did you missed to post the runtime of the job?

No, the job starts early on Monday and the hope was for it to complete sometime Thursday.

thiemowmde renamed this task from Wikidata entity dumpers need to cope with the immense Wikidata growth recently to [Tracking] Wikidata entity dumpers need to cope with the immense Wikidata growth recently.Oct 26 2017, 3:24 PM
thiemowmde removed a project: Patch-For-Review.
thiemowmde updated the task description. (Show Details)

While T178247: Use a retrieve only CachingEntityRevisionLookup for dumps will certainly make the dumps much faster, it will only do so (noticeably) on HHVM. This is because we split the cache between HHVM and Zend (see below), thus the (currently) Zend dumpers wont profit from the cache which is probably mostly populated in the HHVM version of the cache (as all app server run HHVM).
There are some other maintenance scripts using Zend which might also write into this cache… so maybe this will still help something, though.

if ( defined( 'HHVM_VERSION' ) ) {
        // Split the cache up for hhvm. T73461
        $wgWBSharedCacheKey .= '-hhvm';
}

(Probably) due to the DataModel updates the current JSON dump was created in just 25 hours, compared to ~34-35h last week. (This is data from one run only, so not overly reliable… but the difference is huge)

(Probably) due to the DataModel updates the current JSON dump was created in just 25 hours, compared to ~34-35h last week. (This is data from one run only, so not overly reliable… but the difference is huge)

If all future runs turn out that way, this is very good news! Looking forward to the other optimizations too.

Run time for the full TTL dump (the data diff is from the gzipped file):

20170918: 25:30h (5 shards)
20170925: 26:30h (5 shards) (~ +3.7% data)
20171009: 30:30h (5 shards) (~ +6.7% data)
20171016: 36:30h (6 shards) (~ +1.9% data) (contains ad332804b1fea069043d14d0195f6fe2ed5a6f4b and 3164215d0d790f37cc1cf386ef22a188e81a10d0)
20171023: 43:10h (6 shards) (~ +0.8% data)
20171030: 26:00h (6 shards) (~ +1.6% data)

This also indicates that our current changes had a huge impact… but also that ad332804b1fea069043d14d0195f6fe2ed5a6f4b and/ or 3164215d0d790f37cc1cf386ef22a188e81a10d0 might have a huge negative impact here :/

This also indicates that our current changes had a huge impact… but also that ad332804b1fea069043d14d0195f6fe2ed5a6f4b and/ or 3164215d0d790f37cc1cf386ef22a188e81a10d0 might have a huge negative impact here :/

I just quickly looked at both and found some room for some tiny micro-optimizations, but I doubt they're worth it.

56906993f95067ec156cf3412f2dabaefce282ad will probably help here as well (once deployed).

hoo updated the task description. (Show Details)

@hoo we should really look into generating RDF from JSON. Can probably be done in a week.

That would mean moving a lot less data from storage over the network. Should be faster. How much, I can't say...

Mentioned in SAL (#wikimedia-operations) [2017-11-13T17:28:13Z] <hoo> Ran "scap pull" on mwdebug1001 after tests re T177486

Mentioned in SAL (#wikimedia-operations) [2017-11-13T18:09:05Z] <hoo> Ran "scap pull" on mwdebug1001/snapshot1001 after (further) tests re T177486

Change 392670 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Set Wikidata entity dump batch size to 1500

https://gerrit.wikimedia.org/r/392670

Change 392670 merged by ArielGlenn:
[operations/puppet@production] Set Wikidata entity dump batch size to 1500

https://gerrit.wikimedia.org/r/392670

Mentioned in SAL (#wikimedia-operations) [2017-12-06T20:15:26Z] <hoo> Ran "scap pull" on snapshot1001 after T177486 related tests

hoo updated the task description. (Show Details)

All steps identified here have been done and now that the dumps are also on php7, I think we can consider this fixed.

Next step to look into will be T147169… but I think the most pressing issue with regards to run time has been solved (for now).