Page MenuHomePhabricator

Benchmark RDF dump with foreign ID mapping
Closed, ResolvedPublic

Description

@WMDE-leszek wrote https://gerrit.wikimedia.org/r/#/c/345332 for T161592: Account for foreign repositories in RDF mapping. Since RDF generation is a performance hotspot, we should compare profiling data before and after applying the patch, to see what impact it has on performance.

Profiling should be done against a sample of live data, e.g. the first 50k items and properties. The benchmark should be repeated, to make sure it's not skewed by external factors.

Current state of the benchmark collected in the table below (as measured 2017-11-14)
I did a table of running it on master, the old patch and the alternative one.

NameFirst runSecond runThird runAverageTotal run time ratio: master/patched
Master19.9021.09121.09620.696--
https://gerrit.wikimedia.org/r/#/c/345332/26/23.30123.15622.93123.12989.48%
https://gerrit.wikimedia.org/r/#/c/390996/1/20.96520.71621.33721.00698.52%

Event Timeline

daniel triaged this task as High priority.Apr 10 2017, 4:05 PM

Did this performance test happen?

@Smalyshev No, I don't think this was ever picket up. But perhaps @Ladsgroup would like to have a look?

First run on items-repo (items-repo.wmflabs.org) without this patch:

ladsgroup@federated-wikis:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m30.012s
user	0m23.724s
sys	0m0.788s

Second run:

ladsgroup@federated-wikis:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...

real	0m27.564s
user	0m22.016s
sys	0m0.652s

Third run:

ladsgroup@federated-wikis:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
....
real	0m29.570s
user	0m23.360s
sys	0m0.684s

First run on the items-repo with this patch:

ladsgroup@federated-wikis:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...

real	0m32.192s
user	0m26.328s
sys	0m0.584s

Second run:

ladsgroup@federated-wikis:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m33.269s
user	0m27.200s
sys	0m0.668s

Third run:

ladsgroup@federated-wikis:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
real	0m35.599s
user	0m29.300s
sys	0m0.748s

First run on the props-repo.wmflabs.org without the patch:

ladsgroup@federated-wikis2:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7034 entities.

real	0m40.751s
user	0m23.796s
sys	0m0.616s

Second run:

ladsgroup@federated-wikis2:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...

real	0m39.491s
user	0m23.088s
sys	0m0.564s

Third run:

ladsgroup@federated-wikis2:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...

real	0m42.495s
user	0m25.188s
sys	0m0.524s

First run on props-repo with the patch:

ladsgroup@federated-wikis2:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7034 entities.

real	0m50.007s
user	0m30.688s
sys	0m0.512s

Second run:

ladsgroup@federated-wikis2:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...

real	0m44.122s
user	0m27.664s
sys	0m0.480s

Third run:

ladsgroup@federated-wikis2:/var/www/html$ time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
real	0m41.044s
user	0m25.968s
sys	0m0.448s

Things to remember:

  • In props-repo, properties are local and items are coming from the other repo and vice versa.
  • Database of both of them lives in item-repos instance for simplicity reason, one of the reasons that it takes some time is that it tries to read a remote instances IMO.

@Ladsgroup thank you! So, this looks like a 10-15% increase. Not too horrible, but significant.

It would be good to learn what exactly is causing it. Can you run this though xhprof and see what the bottle necks as in the methods that were changed by the patch?

I sent you the raw report, Couldn't make the GUI to work.

To be honest, I have no idea how to make sense of the raw data, and I can't find a visualizytion tool that would take it as input.
@Smalyshev, do you have an idea?

xhprof without GUI is kinda hard to comprehend... where it is installed? I could try to make the GUI work.

@daniel: The paste file is 35K lines and it's bringing my laptop to its knees :( Can you edit your comment and make it link?
@Smalyshev: Yeah, in ssh federated-wikis.eqiad.wmflabs file /var/www/html/extensions/Wikibase/repo/maintenance/dumpRdf.php I added the lines for profiling, I tried "XHProfRuns_Default" but didn't work out. I commented out those parts. Sorry, I just heard about XHPorfiling in this phab card.

@daniel: The paste file is 35K lines and it's bringing my laptop to its knees :( Can you edit your comment and make it link?

done. sorry!

@Smalyshev: I installed XhGui (also with mongodb) but still I get this strange error PHP Fatal error: Class 'XHProfRuns_Default' not found in /var/www/html/extensions/Wikibase/repo/maintenance/dumpRdf.php on line 139. I couldn't find any place to fix it.

GUI seems to work now, see https://items-repo.wmflabs.org/xhprof_html/index.php?run=5984fc915da57&source=dumpRdf
new runs should be added to https://items-repo.wmflabs.org/xhprof_html/ as they happen. On the filesystem they reside on /var/tmp/xhprof, feel free to clean up those that aren't needed anymore.

Didn't analyze the profile itself yet, probably can get to it next week.

Some analysis:

Master run:
https://items-repo.wmflabs.org/xhprof_html/?run=598b8f9c66fab&source=dumpRdf&sort=excl_wt
Patched run:
https://items-repo.wmflabs.org/xhprof_html/?run=598b8d44818ce&source=dumpRdf&sort=excl_wt

We can see that one of the most expensive functions there is Wikibase\DataModel\Entity\EntityId::splitSerialization. In fact, it is super-expensive even in master run, being only second to actual SQL queries, but we get almost 3x calls for it in the patch run.
So, the immediate suggested action would be to see whether we can eliminate additional or even better, all calls to splitSerialization or severely reduce them.

Most of the calls to splitSerialization and all additional ones come from:

  • EntityId::getLocalPart
  • EntityId::getRepositoryName

In the master, top ones are:

  • DispatchingEntityIdParser::parse
  • PrefixMappingEntityIdParser::parse

Together, they are 60% of the calls.
EntityId::getLocalPart mainly comes from RdfVocabulary::getEntityLName, and EntityId::getRepositoryName from various snak builders.

Surprisingly, even though the result of these for every given ID should be static, both call child functions. I think it should be an easy fix to convert both of those function calls to either public variables, of if this is too offensive to good design, functions returning such variables directly, without additional calls.

I also note we construct way more ID objects that there presumably are different IDs (e.g. PropertyId ctor is called 34,892 times even though there are about 3K properties in Wikidata altogether) so I wonder if caching id->object may be helpful. Caching all IDs may be too much (even though well within range of any of our 128G servers :) but at least LRU of reasonable size or something? Not sure how easy it would be to add - this cache is useless for most web workloads probably but would be useful for dump. Maybe we could inject something into WikibaseLib.entitytypes.php? It won't save too much - ItemId ctor is 3.2% and PropertyId ctor is 2.8%, but even 6% gain is not too bad.

Looking into splitSerialization, we see:

  • Call to assertValidSerialization - we validate serialization format on the IDs we take from database, 4,665,466 times for 7K ids. That means we validate each ID over 600 times. Even if there were a case for validating IDs which we take from DB, which I don't see, there is certainly no case for doing it 600+ times. This should be eliminated. Validation takes 1/4 of time consumed by the function.
  • 4 array functions - explode, array_pop, array_shift, implode. They don't take much time, together being about 5%, but I wonder if it could be simpler. Maybe not, but worth spending some thought.
  • normalizeIdSerialization - this takes 10% of the time, and is completely unnecessary, as far as I can see, for IDs coming from DB. Maybe we should have separate code paths for "dirty" IDs - which come from user, etc. and have all validation and transformation and what not - and for "clean" IDs which come from trusted source so we don't have to spend so much cycles on validating clean data?

I am not sure I fully appreciate how IDs work there yet, so we may find further ways to optimize this, let's discuss it. But the above I think are the most immediate things

Thanks @Smalyshev for looking into this! I've just submitted a new patchset of https://gerrit.wikimedia.org/r/#/c/345332 where I've introduced a cache to reduce number of calls to EntityId::getLocalPart and EntityId::getRepositoryName.

As I am really not an expert with the matter, if you had any suggestion what would be a good default value for the cache size (i.e. how many IDs should be cached), that would be appreciated. I guess a thing to consider would be to have a mildly reasonable "general" default (say 1K, 10K, etc), and set it to a bigger number in settings for wikidata(s) so e.g. properties would most likely be all cached.
With some sane default we could run the benchmark again and see how this helped. If you have any other suggestion, also please say, and I'll try to address them.

Also, if this makes any sense, the code could possibly also changed so that properties are cached separately, so we're sure each property ID is always processed only a single time.

I've introduced a cache to reduce number of calls to EntityId::getLocalPart and EntityId::getRepositoryName.

I think we also need to take a good look why splitSerialization is getting called so much. We do not need to do any transformations to the IDs we get from the database - we get Q123, we add prefix to it and we output wd:Q123 or whatever. That should not require calling tons of functions. Yet we do. And we obsessively validate IDs known to be valid. I think we have a case of over-abstraction there or maybe using the same API for clean and dirty input, which is sub-optimal from performance point of view.

I did another round of benchmarking with @WMDE-leszek's new patch and the result is like this:

Without it:

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m24.914s
user	0m20.672s
sys	0m0.396s


root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m25.464s
user	0m20.988s
sys	0m0.456s


root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m25.086s
user	0m20.960s
sys	0m0.392s

With:

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m25.333s
user	0m21.568s
sys	0m0.384s

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m25.793s
user	0m22.044s
sys	0m0.424s

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m25.904s
user	0m21.940s
sys	0m0.412s

I think this is bearable given the advantages it has.

Thanks @Ladsgroup for running another round of benchmark.

To make it clear what has been actually benchmarked. I've update the same patch that has been subject to analysis before, i.e. https://gerrit.wikimedia.org/r/#/c/345332/22. The patch itself has not really been changed, I've just updated it based the current state of the master.
Additionally, based on some comments @Smalyshev made, I've made some changes to several methods of EntityId (i.e. I didn't really make any changes to the Wikibase patch).

Namely, I've changed EntityId::getRepositoryName and EntityId::getLocalPart so that they're not doing splitting etc of the ID serialization but just generate those parts once. The actual change (which is the only thing changed for the whole new benchmark run) is https://github.com/wmde/WikibaseDataModel/commit/81f4186ea81f890769931322f4b36a3f6a0ae68d.
Other issues pointed in how some things are done in EntityId class are still there. These would require some more of fundamental change to make, so I have decided to try avoid them if possible.

@Smalyshev @daniel any comments, concerns about the new attempt? If the path seems fine, and the performance of the new code acceptable, I'ld turn https://github.com/wmde/WikibaseDataModel/commit/81f4186ea81f890769931322f4b36a3f6a0ae68d into actual pull request to Data Model library, and once it's in, https://gerrit.wikimedia.org/r/#/c/345332 could possibly be merged.

I guess it might be still interesting to look at what profiler says on the changed DataModel code. I believe @Ladsgroup could provide it if this would be of help.

@WMDE-leszek The approach looks fine, and should help. The question is if it's sufficient.

The benchmark looks pretty good. I'll review the patches a bit later (I have a cold right now and it's hard for me to concentrate) and if the code is OK I see no reason not to merge it. We could also do another xhprof run before that to see if there are any other things we could improve.

Thanks @daniel and @Smalyshev for the feedback so far.
Submitted https://github.com/wmde/WikibaseDataModel/pull/767 to get those changes made. Feel free to review and comment wherever you find applicable.

With the alternative approach:

Without it (on master):

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m21.666s
user	0m18.016s
sys	0m0.376s

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m21.303s
user	0m17.436s
sys	0m0.504s

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m20.083s
user	0m16.468s
sys	0m0.352s

Patches I cherry-picked:

commit 40df4ad514a32c15b8b84e4e204a8f5edf3b7b20
Author: Leszek Manicki <leszek.manicki@wikimedia.de>
Date:   Wed Apr 5 11:45:12 2017 +0200

    [WIP] Consider foreign entities in RDF mapping
    
    Namespaces used in clauses related to foreign entities now
    contain the repository name (prefix) of the repository
    the entity belongs to.
    
    Repository prefix is also dropped from the "local name" of
    the foreign entity.
    
    DO NOT MERGE before T162371 is resolved!
    
    This is I6acb00b23425bdce4dd03227415e593595b3c1cc without
    the cache in RdfVocabulary.
    
    Change-Id: I1fe6c6ba92d5992d1b6d1802a20fe8b47033443a

commit b63f84d779115e1cd849a81c90873eff678ca345
Author: Leszek Manicki <leszek.manicki@wikimedia.de>
Date:   Thu Nov 2 13:39:28 2017 +0100

    Use DataModel with more efficient EntityId::getRepositoryName and ::getLocalPart
    
    Change-Id: I19a9cf6f5c9b1fd5c3e1a9b2bf77a36c3d1d4739

commit e2069fbc8deaf1fad735b217c4d112daca137b54
Author: Thiemo M<C3><A4>ttig <thiemo.maettig@wikimedia.de>
Date:   Thu Nov 9 18:17:12 2017 +0100

    Reformatting in preparation for foreign entities in RDF mapping
    
    This is some very basic reformating, only done to make the diffs in
    I6acb00b much easier to review.
    
    Bug: T161592
    Change-Id: I9aff0eb930cd755582505b641385932ef9217c36

commit 22cf861cb9e49928876b007540a3f89650c1d30c
Merge: b824931 b9b1ec8
Author: jenkins-bot <jenkins-bot@gerrit.wikimedia.org>
Date:   Tue Nov 14 08:23:05 2017 +0000

    Merge "Add noop hint for Descriptions too"

With it:

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m19.684s
user	0m15.984s
sys	0m0.372s


root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m20.478s
user	0m16.812s
sys	0m0.380s

root@federated-wikis:/var/www/html# time php extensions/Wikibase/repo/maintenance/dumpRdf.php > /dev/null
Dumping shard 0/1
...
Processed 7042 entities.

real	0m20.139s
user	0m16.560s
sys	0m0.328s

I did a table of running it on master, the old patch and the alternative one.

NameFirst runSecond runThird runAverage
Master19.9021.09121.09620.696
Old patch23.30123.15622.93123.129
Alternative patch20.96520.71621.33721.006

Thanks @Ladsgroup!

I've copied over the table to the task description, so it is better visible. I've also added links to the actual patches tested, and added a column with some ratio measuring somehow how slower was the patched code compared to master.

@daniel, @Smalyshev: do you think there is enough data to make a decision if we go with any of the patches, or do we try some other approach/further optimizations?
If any of patches was fine in general, I'd abandon the other one, and ask you for the actual code review then!

I believe 1.5% slowdown is acceptable for the functionality. I will review the code soon and add my comments.

hoo claimed this task.

As far as I can tell this is all done.