Page MenuHomePhabricator

Query optimizer for labels should be using sub queries
Open, MediumPublic

Description

When using the label service a lot of queries time out.
One can avoid that by using a subquery for fetching the data and add the labels in the outer query.
The problem is not everybody is aware of it and it makes the query more complex.
So one solution could be that the optimizer for labels does this automatically.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here’s an example query:

SELECT ?entity ?entityLabel (COUNT(?statement) AS ?count) WHERE {
  hint:Query hint:optimizer "None".
  ?statement wikibase:rank wikibase:DeprecatedRank.
  ?entity ?p ?statement.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?entity ?entityLabel
HAVING(?count > 10)
ORDER BY DESC(?count)
LIMIT 100
OFFSET 0

The HAVING and OFFSET modifiers are somewhat artificial (I added them to have a complete example), but apart from that, this is a real, useful query (entities with most deprecated statements).

Here’s the optimized version:

SELECT ?entity ?entityLabel ?count WITH {
  SELECT ?entity (COUNT(?statement) AS ?count) WHERE {
    hint:Query hint:optimizer "None".
    ?statement wikibase:rank wikibase:DeprecatedRank.
    ?entity ?p ?statement.
  }
  GROUP BY ?entity
  HAVING(?count > 10)
  ORDER BY DESC(?count)
  LIMIT 100
  OFFSET 0
} AS %results WHERE {
  INCLUDE %results.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?count)
  • The inner SELECT clause consists of the original one, but with all Label variables removed.
  • The outer SELECT clause consists of all the variables of the original one, but without expressions (e. g. just ?count).
  • Of the solution modifiers,
    • GROUP BY is only present on the inner query, and with all Label variables removed,
    • HAVING, LIMIT and OFFSET are only present on the inner query, and
    • ORDER BY is present on both queries.

Of course, this optimization should work with or without explicit rdfs:label parameters for the label service, so it’s probably best to implement this as a separate optimizer which runs after the one that imports rdfs:label parameters from the SELECT clause.

Note here:

SELECT ?entity (COUNT(?statement) AS ?count) WHERE {
    hint:Query hint:optimizer "None".
    ?statement wikibase:rank wikibase:DeprecatedRank.
    ?entity ?p ?statement.
  }
  GROUP BY ?entity ?entityLabel
  HAVING(?count > 10)
  ORDER BY DESC(?count)
  LIMIT 100

you group by non-existing variable (?entityLabel). In order to group properly, and also for group variables work properly in projection clause, it will have to understand a lot about the query structure.

I wonder also why the latter is so much faster - maybe we could do something in service to speed it up...

Smalyshev triaged this task as Medium priority.Jul 5 2017, 11:11 PM

you group by non-existing variable (?entityLabel).

Thanks, fixed.

In order to group properly, and also for group variables work properly in projection clause, it will have to understand a lot about the query structure.

It doesn’t need to understand that much, right? It only needs to know which variables are label variables, so that they can be removed from the inner SELECT and GROUP BY.

(Hm, what happens if the label variables are used in the HAVING or ORDER BY solution modifiers? I guess for now, the optimization could be disabled in that case.)

I did a quick check and the first query form resolves 44326 labels, 23178 of them unique, even though it only needs 100 in the second form. The problem is, the query semantics is different there - there's no way to know which labels should be in those 100 selected, and also the label may have participate in projections, conditionals etc. which could influence the result.

This form:

SELECT ?entity ?entityLabel ?count WHERE {
{
  SELECT ?entity (COUNT(?statement) AS ?count) WHERE {
  ?statement wikibase:rank wikibase:DeprecatedRank.
  ?entity ?p ?statement.
}
GROUP BY ?entity ?entityLabel
HAVING(?count > 10)
ORDER BY DESC(?count)
LIMIT 100
OFFSET 0
}
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} ORDER BY DESC(?count)

Does not require named subquery and also produces the same result. I'm not sure though how to automatically and cleanly extract label service outside. I'll think about it.

This form… does not require named subquery and also produces the same result.

Sure, it should be equivalent… but the non-named subquery seems to be a lot slower for whatever reason (~700 ms vs. ~2500 ms). Is there any downside to using a named subquery?

I found the opposite - the non-named one was faster. I suspect the variance is due to the cache state and the server the query runs on.
The downside is a) more verbose form b) using non-standard SPARQL facility. Also, you don't need to disable the optimizer. Other than that, both should work.

Strange… I was under the impression that named subqueries are typically faster, but in this case it doesn’t seem to make much of a difference. I ran both versions 100 times (P5706), and while regular subqueries have a few slow outliers, overall they seem to perform equally well:

subqueries.png (480×640 px, 6 KB)

Edit: The picture for 1000 iterations looks pretty similar, except that this time the named subqueries have a few outliers instead of the regular ones:

subqueries.png (480×640 px, 12 KB)

Looks like proposed change doesn't provide performance improvement, so closing.

I think there must be a misunderstanding here… it’s not clear whether a named subquery or a non-named subquery is faster, but there is a very definitive performance improvement in adding the (named or regular) subquery over the unoptimized query. You yourself wrote:

the first query form resolves 44326 labels, 23178 of them unique, even though it only needs 100 in the second form.

The point of this issue is to defer label service execution until the results have already been found without the label service, unless that’s not possible (e. g. because the labels are used in a HAVING clause, or in an ORDER BY clause where there is also a LIMIT).

@Smalyshev could you point me to the code in blazegraph where the SERVICE wikibase:label function is implemented?

Hi thank you for this pointer!