Page MenuHomePhabricator

results of query.wikidata are unstable (besides caching issues)
Closed, ResolvedPublicBUG REPORT

Description

(see also https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search#query_does_not_reflect_current_state_(not_even_yesterday's_state) )

in short: mountain (Q8502) is a subclass (P279) of landform (Q271669). Querying for mountains thus should result in a subset of the corresponding query for all landforms (as any mountain also is a landform). This is not the case.

caveat: the behaviour is not reproducible over long time spans, as after some days things seem to get synchronized again. But the time span is far beyond anything that could be explained by replication lags, different caching servers with non-sync caches, etc.

Steps to replicate the issue (include links if applicable):

What happens?:

  • the first query lists Östliche Praxmarerkarspitze (Q67083874), while the second doesn't. The second is correct, as Q67083874 has P131 assigned (which is the main selection in the query above).
  • I did not manage to make queries more simple without losing the incriminated behaviour or getting timeouts, sorry.

What should have happened instead?:
the second result set should be a subset of the first result set. Or, a warning or an error message should be displayed in case of some internally recognized bad intermediate state in the processing of the query (e.g. timeouts for single steps in the algorithm).

as an overall consequence:

  • you cannot trust in the results of wikidata queries in general.

Event Timeline

as things are stochastic:
Now the difference between landforms and mountains is in Hanauer Spitze https://www.wikidata.org/wiki/Q21878328 and Brunnkarspitze https://www.wikidata.org/wiki/Q21878293

@Herzi.Pinki sorry to see that this problem is hitting your query again, I still believe that this might be a bug in blazegraph possibly related to how it optimizes it query plan.
I think the section to cause much trouble to blazegraph is the named query:

SELECT DISTINCT ?item WHERE {
    ?item wdt:P17 wd:Q40 ;
          wdt:P625 [] ;
          wdt:P31/wdt:P279* wd:Q271669 .
    minus {?item wdt:P31/wdt:P279* wd:Q46831 .}
    minus {?item wdt:P31/wdt:P279* wd:Q39816 .}
    filter not exists { ?item wdt:P131 [] }
}

I seem to obtain better performances by disabling the blazegraph optimizer (hint:Query hint:optimizer "None". ):

SELECT DISTINCT ?item WHERE {
    hint:Query hint:optimizer "None".  
    ?item wdt:P17 wd:Q40 ;
          wdt:P625 [] ;
          wdt:P31/wdt:P279* wd:Q271669 .
    minus {?item wdt:P31/wdt:P279* wd:Q46831 .}
    minus {?item wdt:P31/wdt:P279* wd:Q39816 .}
    filter not exists { ?item wdt:P131 [] }
}

But telling blazegraph to disable its optimizer we uncover yet another issue:
BIND(IF(EXISTS { ?item p:P18 [] }, '0000ff', 'ff0000') AS ?rgb) .
no longer appears to work appropriately and have to be rewritten as:
BIND(IF(BOUND(?image), '0000ff', 'ff0000') AS ?rgb) .
reusing the ?image var which is attached in an optional clause couple lines before.

I took the liberty to attempt a rewrite of your query as:

#defaultView:Map{"hide":"?rgb"}
SELECT ?item ?itemLabel ?itemDescription (GROUP_CONCAT(DISTINCT ?whereLabel; SEPARATOR=', ') AS ?whereLabels) (SAMPLE(?image) AS ?image) ?coord ?rgb ?layer WITH {
  SELECT DISTINCT ?item WHERE {
    hint:Query hint:optimizer "None".  
    ?item wdt:P17 wd:Q40 .
    ?item wdt:P625 [] .
    ?item wdt:P31/wdt:P279* wd:Q271669 . #Q35145263 . # Q271669 . #

    #?item wdt:P31/wdt:P279* wd:Q35509 .
    minus {?item wdt:P31/wdt:P279* wd:Q46831 .}

    filter not exists {
      ?item wdt:P131 ?wo
      }
  #minus {?item wdt:P31/wdt:P279* wd:Q27686 .}
  #minus {?item wdt:P31/wdt:P279* wd:Q1444 .}
  minus {?item wdt:P31/wdt:P279* wd:Q39816 .}
  }
} AS %subquery1 WHERE {
  INCLUDE %subquery1 .
  ?item wdt:P31 [] .
  ?item p:P625 ?coordStatement .
  ?coordStatement ps:P625 ?coord .
  #MINUS { ?coordStatement prov:wasDerivedFrom/pr:P143 wd:Q169514 } # imported from Wikimedia project: Swedish Wikipedia 
  #MINUS { ?coordStatement prov:wasDerivedFrom/pr:P143 wd:Q837615 } # imported from Wikimedia project: Cebuano Wikipedia 
  #MINUS { ?coordStatement prov:wasDerivedFrom/pr:P248 wd:Q1194038 } # stated in: GEOnet Names Server
  OPTIONAL {
    ?item wdt:P131 ?where .
    OPTIONAL {
      ?where rdfs:label ?whereLiteral .
      FILTER(LANG(?whereLiteral) = 'de') .
    }
  }
  BIND(IF(BOUND(?where), IF(BOUND(?whereLiteral), ?whereLiteral, STRAFTER(STR(?where), 'entity/')), 'no P131') AS ?whereLabel) .
  OPTIONAL { ?item wdt:P18 ?image }
  BIND(IF(BOUND(?image), '0000ff', 'ff0000') AS ?rgb) .
  BIND(IF(BOUND(?image), IF(BOUND(?where), 'With Image & P131', 'With Image but without P131'), IF(BOUND(?where), 'Without Image but with P131', 'Without Image and without P131')) AS ?layer) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE,de,en]' }
} GROUP BY ?item ?itemLabel ?itemDescription ?whereLabels ?coord ?rgb ?layer

It does seem to be slightly faster, I also added two new layers help select items with or without a P131 (hoping that it might ease detecting when this similar bug happens).

Please let us know if this rewritten query suits your needs and if it helps mitigate the issue you're experiencing.

@dcausse thanks for your investigations. Your query is 8 times faster than mine (optimizing is obviously not always the way to go) and it gives 165 matches instead of my query that still gives 171.

for me as a user of the frontend of wikidata query it is difficult to see what fails in the background, even, what is used in the background. Feel free to forward the issue to blazegraph. My problem seems to be solved be rewriting the query.

Gehel claimed this task.
Gehel subscribed.

I'm marking this as resolved as we have a working query. Blazegraph being unmaintained, reporting the issue upstream is not really helpful.

I'm marking this as resolved as we have a working query. Blazegraph being unmaintained, reporting the issue upstream is not really helpful.

Houston, we have a maintenance problem!