Page MenuHomePhabricator

Use a retrieve only CachingEntityRevisionLookup for dumps
Closed, ResolvedPublic

Description

We aren't currently making use of the normal entity caching mechanism when dumping entities, as we don't want to pollute the cache with millions of otherwise unused entities. I did some testing and got the following numbers:

The following tests were done with a caching entity revision lookup (as obtained from SqlStore) vs. a non-caching one. All tests used entity prefetching.

Random access 5,000 entities in batches of 500:

Uncached:
1. 58.209151029587
2. 57.370378017426
3. 57.560675144196

Cached:
1. 55.036426067352
2. 54.444396972656
3. 53.03471493721

Q1 - Q5001 in batches of 500:

Uncached:
1. 109.29708003998
2. 83.949445009232

Cached:
1. 36.519347190857
2. 33.664337873459
3. 34.130303859711

Q1000000 - Q1005001 in batches of 500:

Uncached:
1. 47.415897130966
2. 20.073761940002

Cached:
1. 12.881020069122
2. 11.010383844376

Q40000000 - Q40005001 in batches of 500:

Uncached:
1. 51.220274925232

Cached:
1. 59.866926193237

Q41000000 - Q41005001 in batches of 500 (cached):
1. 58.677440881729

Q39000000 - Q39005001 in batches of 500 (uncached):
1. 51.507272005081

While this suggests some speedup (around 6.5% for random access), this still takes the time for putting cache misses into memcached into account. Due to that I hacked a retrieve-only cache and tested it on mwdebug1001.

Access to 15,000 random entities (in batches of 500) with retrieve-only cache:

Uncached:
1. 85.038383960724
2. 89.931836843491

Cached:
1. 75.183009147644
2. 74.581010103226

This suggests a speedup of over 15% when accessing random entities, which is a workload that should be rather similar to what the dumpers do (they need to access all entities in the end).

Code used (making use of the 'retrieve-only' hack):

<?php
$wikibaseRepo = Wikibase\Repo\WikibaseRepo::getDefaultInstance(); $entityPrefetcher = $wikibaseRepo->getStore()->getEntityPrefetcher(); $revisionLookup = $wikibaseRepo->getEntityRevisionLookup( 'uncached' );

$revisionLookup = new Wikibase\Lib\Store\CachingEntityRevisionLookup( $revisionLookup, wfGetCache( $GLOBALS['wgMainCacheType'] ), 60 * 60 * 24, 'wikibase_shared/wikidata_1_31_0_wmf_3-wikidatawiki-hhvm:WikiPageEntityRevisionLookup', 'retrieve-only' );

$entityLookup = new Wikibase\Lib\Store\RevisionBasedEntityLookup( $revisionLookup );

$t0 = microtime( 1 ); for ( $i = 0; $i < 15; $i++ ) { $toFetch = []; for ( $j = 0; $j < 500; $j++ ) { $toFetch[] = new \Wikibase\DataModel\Entity\ItemId( 'Q' . mt_rand(1, 42025257 ) ); } $entityPrefetcher->prefetch( $toFetch ); foreach ( $toFetch as $itemId ) { try { $entityLookup->getEntity( $itemId ); } catch( Wikibase\Lib\Store\RevisionedUnresolvedRedirectException $e ) {} } }; echo microtime( 1 ) - $t0;

Patch-For-Review:

Event Timeline

Change 384322 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Introduce retrieve-only CachingEntityRevisionLookups and make use

https://gerrit.wikimedia.org/r/384322

I just ran a patched version of dumpJson along with an unmodified version (on mwdebug1001/mwdebug1002). While both servers have the same specs, the runs might not be totally comparable and I only did one run.

I ran sudo -u www-data timeout 1800 php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpJson.php --wiki wikidatawiki --sharding-factor 4 --shard 0 --snippet > /dev/null, the currently deployed version managed to dump 109223 entities, while the modified version managed to dump 194039 in the same time. That's a speedup of 77.7%!

It should be noted that these results might be inflated as the entities with lower ids probably have a way better hit rate in the entity cache than those with higher entity ids (which are less often used).

Change 386382 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Use CacheRetrievingEntityRevisionLookup for dumps etc.

https://gerrit.wikimedia.org/r/386382

Change 384322 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add CacheRetrievingEntityRevisionLookup EntityRevisionCache

https://gerrit.wikimedia.org/r/384322

Change 386382 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use CacheRetrievingEntityRevisionLookup for dumps etc.

https://gerrit.wikimedia.org/r/386382

thiemowmde removed a project: Patch-For-Review.
thiemowmde updated the task description. (Show Details)
thiemowmde moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.