Page MenuHomePhabricator

SPARQL endpoint should gracefully handle cycles and loops in transitive properties
Closed, ResolvedPublic

Description

Transitive properties should be a-cyclic, but the data on wikidata.org is imperfect, so the query endpoint should be able to cope with the presence of cycles. Currently, such cycles seem to result in a timeout, causing basic queries like "like me all cities in X" to fail. Example query:

prefix wd: <http://www.wikidata.org/entity/>
prefix wdt: <http://www.wikidata.org/prop/direct/>

SELECT DISTINCT ?city WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515 .  # find instances of subclasses of city
  ?city wdt:P131* wd:Q1202 .              # ...located in Saxony.
}

(run query)

Event Timeline

daniel created this task.Oct 22 2015, 5:44 PM
daniel assigned this task to Smalyshev.
daniel raised the priority of this task from to High.
daniel updated the task description. (Show Details)
daniel added a subscriber: daniel.
Restricted Application added a project: Discovery. · View Herald TranscriptOct 22 2015, 5:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We seem to have loops in P131, including entries which are P131 to themselves. This clearly needs fixing in data, but loop detection on the engine wouldn't hurt too.

daniel renamed this task from SPARQL endpoint should gracefully handle loops in transitive properties to SPARQL endpoint should gracefully handle cycles and loops in transitive properties.Oct 25 2015, 5:55 PM
daniel updated the task description. (Show Details)
daniel set Security to None.
daniel updated the task description. (Show Details)Oct 25 2015, 5:58 PM
daniel updated the task description. (Show Details)
Jheald added a subscriber: Jheald.Oct 25 2015, 6:37 PM

Transitive properties aren't necessary a-cyclic -- for example, consider P460 "said to be the same as", which is transitive, currently used to link given names together, if they are considered to be variants of the same name in different languages.

The page at

https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants

includes two columns of queries that find all the variants of each name (ie extract the whole equivalence class), using a P460* path query from a given starting name, and count up the number of occurrences of each one.

P460 is transitive, but it is not directed, and so there are huge numbers of loops in the equivalence classes. But this isn't a problem. The path search just keeps going until no new P460 link leads it to a node it has not already seen. So the engine can handles path loops, without a problem.

Similarly, looking at the query Daniel posted,

SELECT (COUNT(DISTINCT(?city)) AS ?count) WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515 .    # find instances of subclasses of city
  ?city wdt:P131* wd:Q1202 .
}

if one comments out either one or the other of the two requirement lines, the query runs without a hitch. So I don't think there is anything to do with the nature of the data (eg whether it contains cycles or not) that affects whether or not Blazegraph can handle it.

The problem appears to arise with how Blazegraph combines the two requirements.

Related to this may be the fact that Blazegraph also fails with path searches of the form

SELECT (COUNT(DISTINCT ?a) AS ?count) WHERE { 
    BIND (wd:Q3305213 AS ?class) .    # paintings 
    ?a wdt:P31/wdt:P279* ?class .
}

even though the same search works perfectly if the variable ?class is eliminated, and wd:Q3305213 instead specified inline.

I filed the latter with Blazegraph as issue 1543 three weeks ago, but no commentary on it from them as yet

https://jira.blazegraph.com/browse/BLZG-1543
Deskana moved this task from Needs triage to WDQS on the Discovery board.Oct 27 2015, 10:27 AM
ksmith moved this task from WDQS to On Sprint Board on the Discovery board.Oct 27 2015, 3:52 PM

What's the status of this? The query above still errors out...

I believe a query like "instance of x in region y" is definitely something we should be able to answer.

I didn't have time for in-depth look into it as I was busy with the data duplication/update issues, but I did not forget about it. See also comments on https://jira.blazegraph.com/browse/BLZG-1543 - maybe trying to play with optimizer hints would make it better. It doesn't look like loops per se are the problem, as simple cases work, but there's definitely some query optimizer issue that breaks it for bigger loops. As soon as I am done with the duplication issue I'll get back to this.

Addshore moved this task from incoming to monitoring on the Wikidata board.Dec 4 2015, 1:18 PM
Smalyshev lowered the priority of this task from High to Medium.Mar 10 2016, 10:44 PM

BTW, looks like this query:

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?item ?itemLabel  {
  SERVICE gas:service {
    gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.SSSP" ;
                gas:in wd:Q1202 ;
                gas:traversalDirection "Reverse" ;
                gas:out ?item ;
                gas:out1 ?depth ;
                gas:maxIterations 12 ;
                gas:linkType wdt:P131 .
  }
  ?item wdt:P31/wdt:P279* wd:Q515 .
  SERVICE wikibase:label {bd:serviceParam wikibase:language "en" }
}

works much better for the same.

Nikki added a subscriber: Nikki.Oct 11 2016, 7:13 PM

Earlier today abian on IRC mentioned that people are adding redundant P131 statements to get their queries to work.

One of the queries which is failing is this one.

I tried to figure out a way to get it to work and noticed that the selecting by P131 part takes less than a second and the selecting by P31 part takes a few seconds (usually between 2-5), but when I combine them together, it times out. I tried various things like subqueries, filters, turning off optimisation and nothing I tried would make it stop timing out. This makes no sense to me.

After finding this ticket, I tried the approach in the comment above and it can do the original query in a couple of seconds: see here

abian added a subscriber: abian.Oct 11 2016, 7:35 PM

Bump.

Some further interesting examples in this thread on Project Chat of searches of this form that work and others that don't.

A search for all buildings in the county of Bedfordshire with Grade I heritage status works; but a search for all churches in the county of Bedfordshire with Grade I heritage status fails -- despite churches being a subclass of buildings, and so part of what gets returned in the first search.

It's clearly an optimiser issue, but just what is the optimiser's decision process here, that flips between buildings and churches ?

These are the sorts of queries that are going to need to work, if WDQS searches are e.g. going to replace or supplement categories in the Structured Data for Commons project.

Smalyshev closed this task as Resolved.Mar 30 2017, 1:12 AM

This works now:

SELECT DISTINCT ?city WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515 . # find instances of subclasses of city
  ?city wdt:P131* wd:Q1202 .
  hint:Prior hint:gearing "forward" .
}

See also https://wiki.blazegraph.com/wiki/index.php/QueryHints
Basically what is going on is that query optimizer doesn't always knows which way it should traverse the path queries, especially if there are many clauses. Playing with gearing hint allows to tell it the right way.

If the query doesn't work with any direction, please add it here and I'll take a look, but for now I think this resolves the task. The optimizer may be improved, but I imagine there will always be cases where we need to give some hints.