Page MenuHomePhabricator

Retrieving labels via SPARQL tanks query performance
Closed, ResolvedPublic

Description

Since T237276 and 781934d, labels for groupings are retrieved in SPARQL instead of using the MediaWiki API.

There were several good reasons for that move, but one unexpected outcome is that it tanks the performance of the grouping query.

When investigating www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Property_statistics/Sandbox, I realized the grouping SPARQL query is now timing out.

SELECT ?grouping (SAMPLE(?_higher_grouping) as ?higher_grouping) ?grouping_link_value (COUNT(DISTINCT ?entity) as ?count) WHERE {
  ?entity wdt:P31 wd:Q3305213 .
  ?entity wdt:P195 ?grouping .
  OPTIONAL { ?grouping wdt:P17/wdt:P298 ?_higher_grouping }.
  OPTIONAL {{
    ?grouping rdfs:label ?labelMUL.
    FILTER(lang(?labelMUL)='mul')
  }}.
  OPTIONAL {{
    ?grouping rdfs:label ?labelEN.
    FILTER(lang(?labelEN)='en')
  }}.
  BIND(COALESCE(?labelEN, ?labelMUL) AS ?grouping_link_value).
} GROUP BY ?grouping ?higher_grouping ?grouping_link_value
HAVING (?count >= 100)
ORDER BY DESC(?count)
LIMIT 1000

does not resolve in time, while the previous version without the labels returns correctly.

Can this query be optimized somehow?

Event Timeline

One potential idea: using subqueries:

SELECT ?grouping ?higher_grouping ?grouping_link_value (COUNT(DISTINCT ?entity) as ?count) 
WITH {
  SELECT ?grouping (SAMPLE(?_higher_grouping) as ?higher_grouping) (COUNT(DISTINCT ?entity) as ?count) WHERE {
    ?entity wdt:P31 wd:Q3305213 .
    ?entity wdt:P195 ?grouping .
    OPTIONAL { ?grouping wdt:P17/wdt:P298 ?_higher_grouping }.
  }
  GROUP BY ?grouping ?higher_grouping
  HAVING (?count >= 100)
  LIMIT 1000
} AS %items
WHERE {
  INCLUDE %items.
  OPTIONAL {
    ?grouping rdfs:label ?labelMUL.
    FILTER(lang(?labelMUL)='mul')
  }.
  OPTIONAL {
    ?grouping rdfs:label ?labelEN.
    FILTER(lang(?labelEN)='en')
  }.
  BIND(COALESCE(?labelEN, ?labelMUL) AS ?grouping_link_value).
}
GROUP BY ?grouping ?higher_grouping ?grouping_link_value
ORDER BY DESC(?count)

does not time out.

However:

  • this will not play well with how I currently compose the SPARQL query
  • my understanding is that subqueries are not standard, so it could mess with things like T385749

It's always better to have as few statements as possible withn Group By, yes.
This should be as efficient but dtill be standard sparql (without WITH block) :
SELECT ?grouping ?higher_grouping ?grouping_link_value ?count WHERE {
{
SELECT ?grouping (COUNT(DISTINCT ?entity) as ?count) WHERE {

?entity wdt:P31 wd:Q3305213 .
?entity wdt:P195 ?grouping .

} GROUP BY ?grouping ?higher_grouping ?grouping_link_value
HAVING (?count >= 100)
}

OPTIONAL { ?grouping wdt:P17/wdt:P298 ?higher_grouping }.
OPTIONAL {
  ?grouping rdfs:label ?labelMUL.
  FILTER(lang(?labelMUL)='mul')
}.
OPTIONAL {
  ?grouping rdfs:label ?labelEN.
  FILTER(lang(?labelEN)='en')
}.
BIND(COALESCE(?labelEN, ?labelMUL) AS ?grouping_link_value).

}
ORDER BY DESC(?count)
LIMIT 1000

Mentioned in SAL (#wikimedia-cloud) [2025-10-27T09:07:10Z] <wmbot~jeanfred@tools-bastion-15> Deploy 69669c0 (Optimize 'get_grouping_information_query' query using subquery) for T400480

JeanFred claimed this task.

I think 69669c0 is good enough as a fix.