More efficient SPARQL queries for sitelinks
Closed, ResolvedPublic

Description

SPARQL queries involving sitelinks are very slow, to the point that it is often impossible to write a query involving sitelinks without the WDQS service timing out.

For example this query, that tries to count the number of Wikidata category-items that have Commons sitelinks, does not complete:

PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX schema: <http://schema.org/>

SELECT (COUNT(DISTINCT ?sitelink) AS ?count) WHERE {

   ?item wdt:P31 wd:Q4167836 .
 
   ?sitelink schema:about ?item .
   ?sitelink schema:inLanguage "en" .     
   FILTER (STRSTARTS(str(?sitelink), "https://commons.wikimedia.org/")) .

}

In contrast, a similar-sized query that does not involve sitelinks completes without trouble:

PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX schema: <http://schema.org/>

SELECT (COUNT(DISTINCT ?commonscat) AS ?count) WHERE {

   ?item wdt:P31 wd:Q4167836 .
   ?item wdt:P373 ?commonscat
 
}

It would seem that the issue could be resolved by adding new statements to the triplestore, of the form

?item wikibase:hasSitelinkTo wd:Q565

where in this case Q565 is the item for Wikimedia Commons

Jheald created this task.Dec 3 2015, 11:42 AM
Jheald updated the task description. (Show Details)
Jheald raised the priority of this task from to Needs Triage.
Jheald added a project: Wikidata-Query-Service.
Jheald added a subscriber: Jheald.
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptDec 3 2015, 11:42 AM
Restricted Application added subscribers: StudiesWorld, Steinsplitter, Aklapper. · View Herald Transcript

The first query has 2M triples matching, but the second one only 300K, so not the same sizes. Also, the second one does join for two relationships, while the first one works on the same ?item. So that may explain the difference.

Introducing wikibase:hasSitelinkTo would require to figure out the following:

  • How we group the sitelinks? Sitelinks are just URLs, so how we know the "https://commons.wikimedia.org/" part is special? What if it links to "https://blah.wikiblah.org"?
  • How we ensure each sitelink has a group to attach to and that group has its own Wikidata entry?
  • How we know the Q-id of that group when generating the dump?

Please note that Wikibase codebase is generic, it is not specific for Wikidata, so each Wikidata-specific functionality has to be configured (e.g. like badges are configured).

Jheald added a comment.EditedDec 4 2015, 12:49 AM

I think you underline exactly my point.

The size of the two final solution sets is very similar. Per the query above, there are currently 345,221 category-items with a P373; whereas according to Autolist: CLAIM[31:4167836] AND LINK[commonswiki], there are currently 340,027 category-items with a Commons sitelink.

In fact, the total number of Commons sitelinks (680,000) that would be going into the join is rather less than the total number of P373 statements (1,240,000).

But the first query is *much* slower because those 680,000 Commons sitelinks are not easily accessible. Instead to get there the system has to try to do a join with the set of all 46 million sitelinks, then join again with the set of sitelinks in English (7.7 million), which it then has to filter to get the sitelinks to Commons, rather than en-wiki or en-wikisource.

So it's not surprising that it is far far slower than it would be if relations were available like

?item wikibase:hasSitelinkTo wd:Q565

Now I know very little about the internals of Wikibase, so I don't know what is specific to Wikidata compared to what is generic to Wikibase (or to Wikibase coupled to a generic MediaWiki installation).

But at least for Wikidata, and the Wikidata UI, it does seem that the set of available destination sitelinks is very narrowly controlled, so it would not be too hard to keep track of a list of corresponding items in the Wikibase.

I think doing that would have significant advantages for us.

But if another installation was not keeping such a list, then it would be easy enough to set a flag that it was not available, so no "wikibase:hasSitelinkTo" statements would be being maintained, and that other installation would be no worse off than we are already.

However, as I understand it, the list of available projects that can be sitelinked to is pretty fundamental in Wikibase, with special code to handle the addition and removal of such sitelinks and reflect them in external database tables, so it should not be too hard to add a routine to have them reflected in the triplestore too.

Lydia_Pintscher triaged this task as Normal priority.Dec 18 2015, 10:26 AM
Lydia_Pintscher set Security to None.
Deskana moved this task from Needs triage to WDQS on the Discovery board.Feb 4 2016, 6:16 AM
Nikki added a subscriber: Nikki.Apr 10 2016, 4:54 PM

Couldn't it use the keys (or whatever the proper word is) from the database for the supported sites? Like Commons is "commonswiki" and then something like ?item wikibase:hasSitelinkTo someprefix:"commonswiki"). for the query. That wouldn't depend on specific Wikidata items, only the list of supported sites.

Smalyshev closed this task as Resolved.Apr 18 2016, 11:24 PM
Smalyshev claimed this task.

You can do it now:

SELECT (COUNT(DISTINCT ?sitelink) AS ?count) WHERE {

   ?item wdt:P31 wd:Q4167836 .
 
   ?sitelink schema:about ?item .
   ?sitelink schema:inLanguage "en" .     
   ?sitelink schema:isPartOf <https://commons.wikimedia.org/> .

}

Though until next data reload the data may be incomplete since isPartOf is a new thing.

Nikki added a comment.Apr 19 2016, 7:03 AM

What is ?sitelink schema:inLanguage "en" for in that query, and why do I get different numbers with (7167) and without (8375) it?

Actually for commons I have no idea what "in language" means. In general, it lists getLanguageCode() of the site, but I'm not sure what it means for commons.

I am wondering if it is possible to ask for any Wikipedia, i.e., "https://*.wikipedia.org/". Is there any verb defined for that?

Smalyshev added a comment.EditedMay 19 2016, 7:41 PM

@Fnielsen You can use wikibase:wikiGroup, see https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Sitelinks

So you'd do something like ?link schema:isPartOf/wikibase:wikiGroup "wikipedia" .