Page MenuHomePhabricator

Bad query performance for FILTER NOT EXISTS
Closed, DeclinedPublic

Description

For updating the data in the graph, we use a query that deletes the old data. The query to delete reference values that are not used by any other statements, and it looks like this, with SELECT replaced with DELETE:

SELECT ?s ?p ?o
WHERE {
  <http://www.wikidata.org/entity/Q30> ?statementPred ?statement .
  FILTER( STRSTARTS(STR(?statement), "http://www.wikidata.org/entity/statement/") ) .
  ?statement <http://www.w3.org/ns/prov#wasDerivedFrom> ?ref .
  # Since references are shared we can only clear the values on them when they are no longer used
  # anywhere else.
  FILTER NOT EXISTS {
    ?otherStatement <http://www.w3.org/ns/prov#wasDerivedFrom> ?ref .
    ?otherEntity ?otherStatementPred ?otherStatement .
    FILTER ( ?otherEntity != <http://www.wikidata.org/entity/Q23>  ) .
  }
  ?ref ?expandedValuePred ?s .
  # Without this filter we'd try to delete stuff from entities.  For example that pattern above matches
  #   ref:_ v:P143 entity:Q328
  # so we'd try to clear everything from Q328 (enwiki).  So we filter where ?s is in the value prefix.
  FILTER( STRSTARTS(STR(?s), "http://www.wikidata.org/entity/value/") ) .
  ?s ?p ?o .
}

This query is very slow (had to kill it after 1+ minute). However, the query without FILTER NOT EXISTS runs under a second:

SELECT ?s ?p ?o
WHERE {
  <http://www.wikidata.org/entity/Q30> ?statementPred ?statement .
  FILTER( STRSTARTS(STR(?statement), "http://www.wikidata.org/entity/statement/") ) .
  ?statement <http://www.w3.org/ns/prov#wasDerivedFrom> ?ref .
  ?ref ?expandedValuePred ?s .
  # Without this filter we'd try to delete stuff from entities.  For example that pattern above matches
  #   ref:_ v:P143 entity:Q328
  # so we'd try to clear everything from Q328 (enwiki).  So we filter where ?s is in the value prefix.
  FILTER( STRSTARTS(STR(?s), "http://www.wikidata.org/entity/value/") ) .
  ?s ?p ?o .
}

This query produces 8 triples, belonging to 2 separate subjects. So the reason of the slowdown is FILTER NOT EXISTS. Interestingly enough, this query:

SELECT ?s ?p ?o
WHERE {
  <http://www.wikidata.org/entity/Q30> ?statementPred ?statement .
  FILTER( STRSTARTS(STR(?statement), "http://www.wikidata.org/entity/statement/") ) .
  ?statement <http://www.w3.org/ns/prov#wasDerivedFrom> ?ref .
  # Since references are shared we can only clear the values on them when they are no longer used
  # anywhere else.
  FILTER NOT EXISTS {
    ?otherStatement <http://www.w3.org/ns/prov#wasDerivedFrom> ?ref .
    ?otherEntity ?otherStatementPred ?otherStatement .
  }
  ?ref ?expandedValuePred ?s .
  # Without this filter we'd try to delete stuff from entities.  For example that pattern above matches
  #   ref:_ v:P143 entity:Q328
  # so we'd try to clear everything from Q328 (enwiki).  So we filter where ?s is in the value prefix.
  FILTER( STRSTARTS(STR(?s), "http://www.wikidata.org/entity/value/") ) .
  ?s ?p ?o .
}

note the internal != filter deleted - also is slow, though in theory it should be failing very fast, since without such filter it is clear FILTER NOT EXISTS contadicts the previous conditions and at least one data set satisfying that condition exists - it's the same data set we have from first three lines of the query.

So looks like these is some issue in processing FILTER NOT EXISTS here, somehow it is not optimal.
The data can be seen in db01 labs machine, in namespace wdq.

Event Timeline

Smalyshev raised the priority of this task from to High.
Smalyshev updated the task description. (Show Details)

Generally speaking, the FILTER NOT EXISTS query is quite challenging, it requires the evaluation of a composed graph pattern for each of the outer solutions. What’s missing here for the query to run efficiently is a distinct projection on the variables that are bound outside and projected in the FILTER NOT EXISTS clause, to eliminate redundant computations. This is inline with other issues we’re currently working on; we will address these issues as part of ticket http://trac.bigdata.com/ticket/1140 in the near future.

For the meantime, here’s a proposal for rewriting the query into (as we understand) an equivalent one that should run in the subseconds range (note that the query slightly differs from your query, e.g. does not contain the filters — we’ve got a different data set running here at the moment). The key idea is to use GROUP BY and HAVING COUNT = 1 to identify statements that are referenced only once, it should be straightforward to apply to the original query:

SELECT ?s ?p ?o
WITH {
  SELECT ?ref
  WHERE {
    <http://www.wikidata.org/entity/Q30> ?statementPred ?statement .
    ?statement <http://www.w3.org/ns/prov#wasDerivedFrom> ?ref . 
  }
  GROUP BY ?ref
  HAVING (COUNT(?statement)=1)
} AS %unreferenced
WHERE {
  INCLUDE %unreferenced
  ?ref ?expandedValuePred ?s .
  ?s ?p ?o .
}

The query uses the (non-standard) SPARQL extensions “WITH”, which allows enforcing the order in which subqueries are evaluated. Once we’ve tackled the optimization, you may switch back to the original query and execute that one more efficiently.

So we actually moved to a different model of updating, now the query we use to filter out the unused entries is:

DELETE { ?s ?p ?o } WHERE {
  VALUES ?s { %values% }
  # Since values are shared we can only clear the values on them when they are no longer used
  # anywhere else.
  FILTER NOT EXISTS { ?someEntity ?someStatementPred ?s .  }
  ?s ?p ?o .
}

(with %values% being replaced by the list of actual IRIs) and it seems to perform OK so far.
We still use this construct though:

DELETE {
  ?s ?p ?o .
}
WHERE {
 <http://www.wikidata.org/entity/Q30> ?statementPred ?s .
  FILTER( STRSTARTS(STR(?s), ""http://www.wikidata.org/entity/statement/") ) .
  ?s ?p ?o .
  FILTER NOT EXISTS {
    VALUES ( ?s ?p ?o ) {
      %valueStatements%
    }
  }
}

Where %valueStatements% is a set of new triples to be introduced later with INSERT, so if old & new data have same triples they won't be deleted and then inserted back.

We'll keep an eye on it and see if we have any more trouble.

This part could be expensive. Instead of having a prefix scan, it is
scanning all statements, materializing the Object position from the
dictionary, and then checking the string representation of that URI for a
match.

http://www.wikidata.org/entity/Q30 ?statementPred ?s .
FILTER( STRSTARTS(STR(?s), ""http://www.wikidata.org/entity/statement/") ) .

We should try to make this a prefix scan on the inline IV range for
http://www.wikidata.org/entity/statement/. This will require being smarter
about pushing down filters.

I have copied Michael who has been looking at how to improve our FILTER
handling.

Thanks,
Bryan


Bryan Thompson
Chief Scientist & Founder
SYSTAP, LLC
4501 Tower Road
Greensboro, NC 27410
bryan@systap.com
http://blazegraph.com
http://blog.bigdata.com http://bigdata.com
http://mapgraph.io

Blazegraph™ http://www.blazegraph.com/ is our ultra high-performance
graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints
APIs. MapGraph™ http://www.systap.com/mapgraph is our disruptive new
technology to use GPUs to accelerate data-parallel graph analytics.

CONFIDENTIALITY NOTICE: This email and its contents and attachments are
for the sole use of the intended recipient(s) and are confidential or
proprietary to SYSTAP. Any unauthorized review, use, disclosure,
dissemination or copying of this email or its contents or attachments is
prohibited. If you have received this communication in error, please notify
the sender by reply email and permanently delete all copies of the email
and its contents and attachments.

Yes, the idea behind this query is to do prefix scan/filter on the IRIs. If there's some syntax we'd be better using to achieve the same we could change this query. In general, the amount of predicates from Q30 shouldn't be _that_ high (no more that hundreds) so it may not be that big of a deal right now.

Ok. At that number this will not matter. But it is not being as efficient
as you would hope. Something to work on.

Bryan

OK, added this as a comment to http://trac.bigdata.com/ticket/1083, which deals with the issue of filter pushing.

Manybubbles set Security to None.