Page MenuHomePhabricator

Micro optimize Wikibase\DataModel\Deserializers\StatementDeserializer::deserialize
Closed, ResolvedPublic

Description

That function is being called a lot when for example generating dumps (once for each of our 130M+ statements), and I found it to be rather slow (when profiling the dumpers locally).

By just inlining some of the function calls it does (or even just the ifs in the functions called by getDeserialized), we can probably gain a significant speedup.
During some playing around, I managed to get a speedup of about 6-7% when inlining getDeserialized into deserialize and by pulling the initial guard ifs from the set… functions up into the calling code.

Related Objects

Event Timeline

I profiled this in production (on mwdebug1001, using HHVM and dumping the first 40,000 entities from shard 0/4):

Command line:
sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpJson.php --wiki wikidatawiki --sharding-factor 4 --shard 0 --snippet --limit 40000 --profiler=text

Result extract:
43.96% 713097.698 814618 - Wikibase\DataModel\Deserializers\StatementDeserializer::deserialize

Given our dumps (all shards combined) have a average run time of about 65 hours, we spent almost 29 hours each time in this function… a small saving of 5% would save us more than 100 minutes for each dump run.

I did another test with the unmodified class (on mwdebug1002):
43.58% 727005.407 815558 - Wikibase\DataModel\Deserializers\StatementDeserializer::deserialize (total time 1668083.546ms)

With the optimized script (same command line, on mwdebug1001):
41.61% 692661.326 815574 - Wikibase\DataModel\Deserializers\StatementDeserializer::deserialize (total time 1664704.396ms)

Both scripts ran at (about) the same time and have a very close total run time (especially if you count out the changes in deserialize).

Considering these numbers, we get a speedup of about 5% with the modified version.

Another run after a few more minor modifications (mwdebug1001 and mwdebug1002 side by side, both using db1082, command line as above):

Optimized (mwdebug1001):
43.46% 651201.869 815697 - Wikibase\DataModel\Deserializers\StatementDeserializer::deserialize (total time 1498537.701ms)
Old (mwdebug1002):
44.27% 727679.741 815698 - Wikibase\DataModel\Deserializers\StatementDeserializer::deserialize (total time 1643905.307ms)

That makes a speedup of 11.7% for deserialize, but the fact that the job on mwdebug1001 was 4.4% faster (the time saved by the mentioned optimization counted out) than the one on mwdebug1002 should be considered.

Considering T157380#3009442 the impact of this is potentially way bigger than anticipated above (if the database interaction time is reduced by such a factor). I didn't benchmark it, though.

thiemowmde triaged this task as Medium priority.
thiemowmde moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.
thiemowmde removed a project: Patch-For-Review.
thiemowmde moved this task from incoming to in progress on the Wikidata board.