Page MenuHomePhabricator

Does WDQS pre-declare <https://fr.wikipedia.org/> for schema:isPartOf ?
Closed, InvalidPublic

Description

I might be mistaken, but it seems WDQS does not pre-declare all known wikipedia values that are used for each sitelink. Does it make sense to pre-declare them to improve performance?

Event Timeline

Smalyshev moved this task from Incoming to Need investigation on the Wikidata-Query-Service board.

Yes, the wiki URLs are not the part of the initial vocabulary. I'm not sure whether it is worth the trouble. It might indeed increase performance slightly, mostly for huge wikis like enwiki, but that'd probably require some configuration and reduce generality of the solution. Needs investigation to show if it is beneficial.

There are 60million of isPartOf statements, and I assume that all of them have their objects as the root of a wiki. The space saving would be 9-2 bytes = ~0.5 GB, so not very significant, but it also eliminates a lookup for each value. I wonder how we can measure the performance benefit. I couldn't run the count(distinct ?obj)query due to timeout.

On the same topic, the language code for schema:inLanguage "en" has the same issue - there are also 60 million of them, with about 10 mil being English - so that's another 0.5 GB, plus some unknown perf benefit. I wonder if it would make sense to pre-declare only top 10 languages and top 10 wikis - I suspect it would almost as beneficial, without having to maintain ever changing list.

I am not sure these micro-optimizations are worth the increased complexity... Maybe need a test to see if it really produces any noticeable difference.

Gehel subscribed.

We might reopen this if we observe performance of space issues related to this. At the moment there isn't a clear benefit.