Page MenuHomePhabricator

MWAPI query with LIMIT ignores MINUS
Closed, ResolvedPublic

Description

@Daniel_Mietchen wanted to find scientific articles about the Zika virus without a corresponding “main subject” statement. This query (link) does the job:

SELECT DISTINCT ?item WHERE {
  hint:Query hint:optimizer "None".
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:api "Search";
                    wikibase:endpoint "www.wikidata.org";
                    mwapi:srsearch "zika haswbstatement:P31=Q13442814".
    ?title wikibase:apiOutput mwapi:title.
  }
  BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item)
  MINUS { ?item wdt:P921 wd:Q202864. }
}

(It now only returns very few results, because @Daniel_Mietchen added the missing “main subjects” statements with QuickStatements. Originally it was a bit over 150 items.)

However, when you add some LIMIT (link), the query suddenly returns many more results – it seems like the MINUS is being ignored for some reason.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CommunityTechBot renamed this task from hqaaaaaaaa to MWAPI query with LIMIT ignores MINUS.Jul 2 2018, 4:39 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

I think there's two problems with the query:

  1. You probably want "FILTER NOT EXISTS", not "MINUS", to do what you want. See: https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#neg-notexists-minus
  1. Note that wdt: would capture only preferred ones if there's preferred one. So if you want to capture all subjects, not only preferred one, use p:P921/ps:P921.

With these corrections, the query seems to work fine.

I can still reproduce the bug with your suggested changes. This query returns 3500 results:

SELECT DISTINCT ?item WHERE {
  hint:Query hint:optimizer "None".
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:api "Search";
                    wikibase:endpoint "www.wikidata.org";
                    mwapi:srsearch "zika haswbstatement:P31=Q13442814".
    ?title wikibase:apiOutput mwapi:title.
  }
  BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item)
  FILTER NOT EXISTS { ?item p:P921/ps:P921 wd:Q202864. }
}
LIMIT 10000

[Without the LIMIT](https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%20hint%3AQuery%20hint%3Aoptimizer%20%22None%22.%0A%20%20SERVICE%20wikibase%3Amwapi%20%7B%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Aapi%20%22Search%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wikibase%3Aendpoint%20%22www.wikidata.org%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20mwapi%3Asrsearch%20%22zika%20haswbstatement%3AP31%3DQ13442814%22.%0A%20%20%20%20%3Ftitle%20wikibase%3AapiOutput%20mwapi%3Atitle.%0A%20%20%7D%0A%20%20BIND%28IRI%28CONCAT%28STR%28wd%3A%29%2C%20%3Ftitle%29%29%20AS%20%3Fitem%29%0A%20%20FILTER%20NOT%20EXISTS%20%7B%20%3Fitem%20p%3AP921%2Fps%3AP921%20wd%3AQ202864.%20%7D%0A%7D), only 6 results are returned.

(I also don’t understand why the FILTER NOT EXISTS should make a difference – the two parts share the ?item variable, and whether it’s bound or not within the MINUS / FILTER NOT EXISTS block shouldn’t matter as far as I can tell.)

Smalyshev triaged this task as Medium priority.Jul 30 2018, 11:45 PM

I also don’t understand why the FILTER NOT EXISTS should make a difference

If you do not have filter, the result should have all items. If you do, the result should have only ones that do not have p:P921/ps:P921 wd:Q202864.. That's the difference.

Without the LIMIT, only 6 results are returned.

Ah, this seems to be a bug, absence of limit certainly shouldn't make a query return less results.

If you do not have filter, the result should have all items. If you do, the result should have only ones that do not have p:P921/ps:P921 wd:Q202864.. That's the difference.

What I meant was, why should MINUS be different from FILTER NOT EXISTS here? I know they’re different in some situations, but as far as I understand this should not be one of those situations.

The difference seems to come because Blazegraph chooses different query plans with and without limit. Without limit:

1com.bigdata.bop.rdf.join.ChunkedMaterializationOp[17](ProjectionOp[16])[ ChunkedMaterializationOp.vars=[item], IPredicate.relationName=[wdq.lex], IPredicate.timestamp=1542326388339, ChunkedMaterializationOp.materializeAll=true, PipelineOp.sharedState=true, BOp.bopId=17, BOp.timeout=600000, BOp.namespace=wdq, QueryEngine.queryId=8bd358b5-e0f1-4a64-9776-2e1eb4c45479, QueryEngine.chunkHandler=com.bigdata.bop.engine.ManagedHeapStandloneChunkHandler@6e28fe95]
2 com.bigdata.bop.solutions.ProjectionOp[16](JVMDistinctBindingSetsOp[15])[ BOp.bopId=16, BOp.evaluationContext=CONTROLLER, PipelineOp.sharedState=true, JoinAnnotations.select=[item]]
3 com.bigdata.bop.solutions.JVMDistinctBindingSetsOp[15](JVMSolutionSetHashJoinOp[14])[ BOp.bopId=15, HashJoinAnnotations.joinVars=[item], BOp.evaluationContext=CONTROLLER, PipelineOp.sharedState=true]
4 com.bigdata.bop.join.JVMSolutionSetHashJoinOp[14](PipelineJoin[13])[ BOp.bopId=14, BOp.evaluationContext=CONTROLLER, PipelineOp.maxParallel=1, PipelineOp.sharedState=true, JoinAnnotations.constraints=null, SolutionSetHashJoinOp.release=true, PipelineOp.lastPass=true, namedSetRef=NamedSolutionSetRef{localName=--set-8,queryId=8bd358b5-e0f1-4a64-9776-2e1eb4c45479,joinVars=[]}]
5 com.bigdata.bop.join.PipelineJoin[13](JVMDistinctBindingSetsOp[10])[ BOp.bopId=13, JoinAnnotations.constraints=null, AST2BOpBase.simpleJoin=true, BOp.evaluationContext=ANY, AccessPathJoinAnnotations.predicate=com.bigdata.rdf.spo.SPOPredicate[11](item=null, Vocab(6)[http://www.wikidata.org/prop/direct/P]:XSDUnsignedShort(921), Vocab(2)[http://www.wikidata.org/entity/Q]:XSDUnsignedInt(202864), --anon-12=null)[ IPredicate.relationName=[wdq.spo], IPredicate.timestamp=1542326388339, BOp.bopId=11, AST2BOpBase.estimatedCardinality=4006, AST2BOpBase.originalIndex=POS, IPredicate.flags=[KEYS,VALS,READONLY,PARALLEL]]]
6 com.bigdata.bop.solutions.JVMDistinctBindingSetsOp[10](HashIndexOp[9])[ BOp.bopId=10, HashJoinAnnotations.joinVars=[item], BOp.evaluationContext=CONTROLLER, PipelineOp.sharedState=true]
7 com.bigdata.bop.join.HashIndexOp[9](VariableUnificationOp[7])[ BOp.bopId=9, BOp.evaluationContext=CONTROLLER, PipelineOp.maxParallel=1, PipelineOp.lastPass=true, PipelineOp.sharedState=true, JoinAnnotations.joinType=NotExists, HashJoinAnnotations.joinVars=[], bindingSets=null, HashJoinAnnotations.outputDistinctJVs=false, JoinAnnotations.constraints=null, HashJoinAnnotations.askVar=null, HashIndexOpBase.utilFactory=com.bigdata.bop.join.JVMHashJoinUtility$1@3fcfb833, namedSetRef=NamedSolutionSetRef{localName=--set-8,queryId=8bd358b5-e0f1-4a64-9776-2e1eb4c45479,joinVars=[]}, IPredicate.relationName=[wdq.lex]]
8 com.bigdata.bop.rdf.join.VariableUnificationOp[7](MockTermResolverOp[6])[ VariableUnificationOp.vars=[item, edf7fbc0-5023-49c0-902a-2de5ba7473c4], BOp.bopId=7]
9 com.bigdata.bop.rdf.join.MockTermResolverOp[6](ConditionalRoutingOp[3])[ MockTermResolverOp.vars=[edf7fbc0-5023-49c0-902a-2de5ba7473c4], IPredicate.relationName=[wdq.lex], IPredicate.timestamp=1542326388339, PipelineOp.sharedState=true, BOp.bopId=6]
10 com.bigdata.bop.bset.ConditionalRoutingOp[3](ChunkedMaterializationOp[5])[ BOp.bopId=3, ConditionalRoutingOp.condition=com.bigdata.rdf.internal.constraints.ProjectedConstraint(com.bigdata.rdf.internal.constraints.ConditionalBind(edf7fbc0-5023-49c0-902a-2de5ba7473c4,com.bigdata.rdf.internal.constraints.IriBOp(com.bigdata.rdf.internal.constraints.ConcatBOp(TermId(0L)[http://www.wikidata.org/entity/],title)[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542326388339])[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542326388339, IriBOp.baseURI=http://localhost:9999/bigdata/namespace/wdq/sparql])[ ConditionalBind.projection=false])]
11 com.bigdata.bop.rdf.join.ChunkedMaterializationOp[5](ConditionalRoutingOp[4])[ ChunkedMaterializationOp.vars=[title], IPredicate.relationName=[wdq.lex], IPredicate.timestamp=1542326388339, ChunkedMaterializationOp.materializeAll=false, PipelineOp.sharedState=true, PipelineOp.reorderSolutions=true, PipelineOp.maxParallel=5, BOp.bopId=5]
12 com.bigdata.bop.bset.ConditionalRoutingOp[4](MockTermResolverOp[2])[ BOp.bopId=4, ConditionalRoutingOp.condition=com.bigdata.rdf.internal.constraints.SPARQLConstraint(com.bigdata.rdf.internal.constraints.NeedsMaterializationBOp(com.bigdata.rdf.internal.constraints.IriBOp(com.bigdata.rdf.internal.constraints.ConcatBOp(TermId(0L)[http://www.wikidata.org/entity/],title)[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542326388339])[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542326388339, IriBOp.baseURI=http://localhost:9999/bigdata/namespace/wdq/sparql])), PipelineOp.altSinkRef=3]
13 com.bigdata.bop.rdf.join.MockTermResolverOp[2](ServiceCallJoin[1])[ MockTermResolverOp.vars=[title], IPredicate.relationName=[wdq.lex], IPredicate.timestamp=1542326388339, PipelineOp.sharedState=true, BOp.bopId=2]
14 com.bigdata.bop.controller.ServiceCallJoin[1]()[ BOp.bopId=1, BOp.evaluationContext=CONTROLLER, PipelineOp.pipelined=false, PipelineOp.sharedState=true, ServiceCallJoin.serviceNode=
15SERVICE <ConstantNode(TermId(0U)[http://wikiba.se/ontology#mwapi])> {
16 JoinGroupNode [optimizer=None] {
17 StatementPatternNode(ConstantNode(TermId(0U)[http://www.bigdata.com/rdf#serviceParam]), ConstantNode(TermId(0U)[http://wikiba.se/ontology#api]), ConstantNode(TermId(0L)[Search])) [scope=DEFAULT_CONTEXTS]
18 StatementPatternNode(ConstantNode(TermId(0U)[http://www.bigdata.com/rdf#serviceParam]), ConstantNode(TermId(0U)[http://wikiba.se/ontology#endpoint]), ConstantNode(TermId(0L)[www.wikidata.org])) [scope=DEFAULT_CONTEXTS]
19 StatementPatternNode(ConstantNode(TermId(0U)[http://www.bigdata.com/rdf#serviceParam]), ConstantNode(TermId(0U)[https://www.mediawiki.org/ontology#API/srsearch]), ConstantNode(TermId(0L)[zika haswbstatement:P31=Q13442814])) [scope=DEFAULT_CONTEXTS]
20 StatementPatternNode(VarNode(title), ConstantNode(TermId(0U)[http://wikiba.se/ontology#apiOutput]), ConstantNode(TermId(0U)[https://www.mediawiki.org/ontology#API/title])) [scope=DEFAULT_CONTEXTS]
21 }
22}, ServiceCallJoin.namespace=wdq, ServiceCallJoin.timestamp=1542326388339, HashJoinAnnotations.joinVars=[], JoinAnnotations.constraints=null]

with limit:

1 com.bigdata.bop.solutions.ProjectionOp[14](JVMDistinctBindingSetsOp[13])[ BOp.bopId=14, BOp.evaluationContext=CONTROLLER, PipelineOp.sharedState=true, JoinAnnotations.select=[item]]
2 com.bigdata.bop.solutions.JVMDistinctBindingSetsOp[13](PipelinedHashIndexAndSolutionSetJoinOp[12])[ BOp.bopId=13, HashJoinAnnotations.joinVars=[item], BOp.evaluationContext=CONTROLLER, PipelineOp.sharedState=true]
3 com.bigdata.bop.join.PipelinedHashIndexAndSolutionSetJoinOp[12](VariableUnificationOp[7])[ BOp.bopId=12, BOp.evaluationContext=CONTROLLER, PipelineOp.maxParallel=1, PipelineOp.lastPass=true, PipelineOp.sharedState=true, JoinAnnotations.joinType=NotExists, HashJoinAnnotations.joinVars=[], bindingSets=null, PipelinedHashIndexAndSolutionSetJoinOp.projectInVars=[item], JoinAnnotations.constraints=null, HashJoinAnnotations.askVar=null, HashIndexOpBase.utilFactory=com.bigdata.bop.join.JVMPipelinedHashJoinUtility$1@4db4b7e5, namedSetRef=NamedSolutionSetRef{localName=--set-8,namespace=null,timestamp=-1,joinVars=[]}, SubqueryAnnotations.subquery=com.bigdata.bop.join.PipelineJoin[11]()[ BOp.bopId=11, JoinAnnotations.constraints=null, AST2BOpBase.simpleJoin=true, BOp.evaluationContext=ANY, AccessPathJoinAnnotations.predicate=com.bigdata.rdf.spo.SPOPredicate[9](item=null, Vocab(6)[http://www.wikidata.org/prop/direct/P]:XSDUnsignedShort(921), Vocab(2)[http://www.wikidata.org/entity/Q]:XSDUnsignedInt(202864), --anon-10=null)[ IPredicate.relationName=[wdq.spo], IPredicate.timestamp=1542324240635, BOp.bopId=9, AST2BOpBase.estimatedCardinality=4006, AST2BOpBase.originalIndex=POS, IPredicate.flags=[KEYS,VALS,READONLY,PARALLEL]], BOp.namespace=wdq], IPredicate.relationName=[wdq.lex]]
4 @com.bigdata.bop.controller.SubqueryAnnotations.subquery:
5 com.bigdata.bop.join.PipelineJoin[11]()[ BOp.bopId=11, JoinAnnotations.constraints=null, AST2BOpBase.simpleJoin=true, BOp.evaluationContext=ANY, AccessPathJoinAnnotations.predicate=com.bigdata.rdf.spo.SPOPredicate[9](item=null, Vocab(6)[http://www.wikidata.org/prop/direct/P]:XSDUnsignedShort(921), Vocab(2)[http://www.wikidata.org/entity/Q]:XSDUnsignedInt(202864), --anon-10=null)[ IPredicate.relationName=[wdq.spo], IPredicate.timestamp=1542324240635, BOp.bopId=9, AST2BOpBase.estimatedCardinality=4006, AST2BOpBase.originalIndex=POS, IPredicate.flags=[KEYS,VALS,READONLY,PARALLEL]], BOp.namespace=wdq]
6 com.bigdata.bop.rdf.join.VariableUnificationOp[7](MockTermResolverOp[6])[ VariableUnificationOp.vars=[item, 466e79b2-977c-47b0-8e2e-e689b236f1d3], BOp.bopId=7]
7 com.bigdata.bop.rdf.join.MockTermResolverOp[6](ConditionalRoutingOp[3])[ MockTermResolverOp.vars=[466e79b2-977c-47b0-8e2e-e689b236f1d3], IPredicate.relationName=[wdq.lex], IPredicate.timestamp=1542324240635, PipelineOp.sharedState=true, BOp.bopId=6]
8 com.bigdata.bop.bset.ConditionalRoutingOp[3](ChunkedMaterializationOp[5])[ BOp.bopId=3, ConditionalRoutingOp.condition=com.bigdata.rdf.internal.constraints.ProjectedConstraint(com.bigdata.rdf.internal.constraints.ConditionalBind(466e79b2-977c-47b0-8e2e-e689b236f1d3,com.bigdata.rdf.internal.constraints.IriBOp(com.bigdata.rdf.internal.constraints.ConcatBOp(TermId(0L)[http://www.wikidata.org/entity/],title)[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542324240635])[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542324240635, IriBOp.baseURI=http://localhost:9999/bigdata/namespace/wdq/sparql])[ ConditionalBind.projection=false])]
9 com.bigdata.bop.rdf.join.ChunkedMaterializationOp[5](ConditionalRoutingOp[4])[ ChunkedMaterializationOp.vars=[title], IPredicate.relationName=[wdq.lex], IPredicate.timestamp=1542324240635, ChunkedMaterializationOp.materializeAll=false, PipelineOp.sharedState=true, PipelineOp.reorderSolutions=true, PipelineOp.maxParallel=5, BOp.bopId=5]
10 com.bigdata.bop.bset.ConditionalRoutingOp[4](MockTermResolverOp[2])[ BOp.bopId=4, ConditionalRoutingOp.condition=com.bigdata.rdf.internal.constraints.SPARQLConstraint(com.bigdata.rdf.internal.constraints.NeedsMaterializationBOp(com.bigdata.rdf.internal.constraints.IriBOp(com.bigdata.rdf.internal.constraints.ConcatBOp(TermId(0L)[http://www.wikidata.org/entity/],title)[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542324240635])[ IVValueExpression.namespace=wdq.lex, IVValueExpression.timestamp=1542324240635, IriBOp.baseURI=http://localhost:9999/bigdata/namespace/wdq/sparql])), PipelineOp.altSinkRef=3]
11 com.bigdata.bop.rdf.join.MockTermResolverOp[2](ServiceCallJoin[1])[ MockTermResolverOp.vars=[title], IPredicate.relationName=[wdq.lex], IPredicate.timestamp=1542324240635, PipelineOp.sharedState=true, BOp.bopId=2]
12 com.bigdata.bop.controller.ServiceCallJoin[1]()[ BOp.bopId=1, BOp.evaluationContext=CONTROLLER, PipelineOp.pipelined=false, PipelineOp.sharedState=true, ServiceCallJoin.serviceNode=
13SERVICE <ConstantNode(TermId(0U)[http://wikiba.se/ontology#mwapi])> {
14 JoinGroupNode [optimizer=None] {
15 StatementPatternNode(ConstantNode(TermId(0U)[http://www.bigdata.com/rdf#serviceParam]), ConstantNode(TermId(0U)[http://wikiba.se/ontology#api]), ConstantNode(TermId(0L)[Search])) [scope=DEFAULT_CONTEXTS]
16 StatementPatternNode(ConstantNode(TermId(0U)[http://www.bigdata.com/rdf#serviceParam]), ConstantNode(TermId(0U)[http://wikiba.se/ontology#endpoint]), ConstantNode(TermId(0L)[www.wikidata.org])) [scope=DEFAULT_CONTEXTS]
17 StatementPatternNode(ConstantNode(TermId(0U)[http://www.bigdata.com/rdf#serviceParam]), ConstantNode(TermId(0U)[https://www.mediawiki.org/ontology#API/srsearch]), ConstantNode(TermId(0L)[zika haswbstatement:P31=Q13442814])) [scope=DEFAULT_CONTEXTS]
18 StatementPatternNode(VarNode(title), ConstantNode(TermId(0U)[http://wikiba.se/ontology#apiOutput]), ConstantNode(TermId(0U)[https://www.mediawiki.org/ontology#API/title])) [scope=DEFAULT_CONTEXTS]
19 }
20}, ServiceCallJoin.namespace=wdq, ServiceCallJoin.timestamp=1542324240635, HashJoinAnnotations.joinVars=[], JoinAnnotations.constraints=null]'

The difference is JVMSolutionSetHashJoinOp vs. PipelinedHashIndexAndSolutionSetJoinOp. The former seems to process MINUS correctly, but the latter for some reason does not.