Page MenuHomePhabricator

Queries with wikibase:statements or wikibase:sitelinks are slow
Closed, ResolvedPublic

Description

I would expect this query to be very efficient:

SELECT ?item ?sitelinks WHERE {
  ?item wikibase:statements ?sitelinks.
}
ORDER BY DESC(?sitelinks)
LIMIT 10

But it times out. The same query with another numeric predicate, like wdt:P2067 (mass), is fairly efficient. I suspect Blazegraph for some reason doesn’t have an index for sorted wikibase:statements (or whatever the equivalent Blazegraph term is)… perhaps because it’s xsd:integer instead of xsd:decimal?

Same situation for wikibase:sitelinks.

Motivation: a Wikipedia user suggested searching for items wtih most sitelinks but no dewiki article. This query should find such items, but also times out:

SELECT ?item ?sitelinks WHERE {
  ?item wikibase:sitelinks ?sitelinks.
  MINUS {
    ?deArticle schema:about ?item;
               schema:isPartOf <https://de.wikipedia.org/>.
  }
}
ORDER BY DESC(?sitelinks)
LIMIT 10

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev triaged this task as Medium priority.Sep 25 2017, 12:10 AM

The first query essentially has to go through all 38 millions of items and sort sitelink values for them. I don't think this can be done efficiently.

I found a workaround by poking around the query hints:

# items with most statements
SELECT ?item ?itemLabel ?statements WHERE {
  ?item wikibase:statements ?statements. hint:Prior hint:rangeSafe true.
  FILTER(?statements >= 1000)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?statements)
LIMIT 10

With the rangeSafe hint (see documentation – “push range filter down onto an index”, which we can do because our predicates are, unknown value aside, strongly typed), the FILTER suddenly becomes magically efficient, and the whole query completes in a few hundred milliseconds. Unfortunately, this requires the FILTER, with some value that you assume won’t actually limit the results (i. e., you assume the top ten results have well above 1000 statements) – apparently the ORDER BY + LIMIT combination isn’t something that BlazeGraph “pushes down” similarly.

Smalyshev claimed this task.

Given that we have a workaround and I don't see how we could do anything else for now, I'm closing this unless there are any other ideas of what can be done.

I'm running into this problem too. Queries are slow or even time out for chemicals. The hints to do not seem to improve the query time significantly:

SELECT ?wikis ?compound WHERE {
  ?compound wdt:P31 wd:Q11173 ;
            wikibase:sitelinks ?wikis . hint:Prior hint:rangeSafe true .
  FILTER(NOT EXISTS {?compound wdt:P2119 []})
} ORDER BY DESC(?wikis)
LIMIT 100

@EgonWillighagen the rangeSafe hint only helps when there is a FILTER, not with ORDER BY. This completes within three seconds:

SELECT ?wikis ?compound WHERE {
  ?compound wdt:P31 wd:Q11173 ;
            wikibase:sitelinks ?wikis . hint:Prior hint:rangeSafe true .
  FILTER(?wikis >= 50)
  FILTER(NOT EXISTS {?compound wdt:P2119 []})
} ORDER BY DESC(?wikis)
LIMIT 100

(Though I had to make a few tries before I determined >= 50 as the threshold; > 50 only produced 98 results, so you may want to use a slightly lower limit to ensure you always get 100 results back.)

Yes, in the end we want data for all chemicals, but this is a good tradeoff. I'll implement! Thanks!