Page MenuHomePhabricator

Duplicate rows in Wikidata sparql
Closed, ResolvedPublic

Description

I am getting duplicate rows on a specific SPARQL query with query.wikidata.org. The duplicates seem to arise when a value has been edited in the references. Besides a BlazeGraph issue this could be my misunderstanding of SPARQL.

For a query that has the issue see: https://www.mediawiki.org/wiki/Talk:Wikidata_query_service and the below:

prefix pr: <http://www.wikidata.org/prop/reference/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#> 

SELECT ?work ?workLabel ?author ?authorLabel ?genre ?location ?locationLabel ?geo ?location_statement ?citat  WHERE {
  ?work wdt:P31/wdt:P279* wd:Q386724 . 
  ?work wdt:P50 ?author .
  ?work p:P840 ?location_statement .
  ?location_statement v:P840 ?location .
  ?location wdt:P17 wd:Q35 .
  ?location wdt:P625 ?geo . 
  OPTIONAL {
    ?location_statement prov:wasDerivedFrom ?ref .
    ?ref pr:P1683 ?citat .
  }

  SERVICE wikibase:label { bd:serviceParam wikibase:language "da,en" . }
}

Event Timeline

Fnielsen raised the priority of this task from to Needs Triage.
Fnielsen updated the task description. (Show Details)
Fnielsen added a project: Wikidata.
Fnielsen subscribed.

I have a feeling I have just run into something related:

If I start with a reference that has a has 7b7e7e8cce78cd60ee82a71ffa3dc6b063a9d37c which is uniquie on wikibase and run:

SELECT ?s WHERE {?s prov:wasDerivedFrom wdref:7b7e7e8cce78cd60ee82a71ffa3dc6b063a9d37c}

I get 1

If I then change this reference meaning it has a new hash, for example eed91b2be4de94dc395e4126a2386bd257e1865e which I have also made uniquie on wikibase and run the count with the new hash:

SELECT ?s WHERE {?s prov:wasDerivedFrom wdref:eed91b2be4de94dc395e4126a2386bd257e1865e}

I get 1 as expected.
However if I go back and run the original query with the old hash which no longer exists I also get a count of 1

@Addshore I'm looking on https://www.wikidata.org/wiki/Special:EntityData/Q4115189.ttl?flavor=dump now and I do not see any refs with hash eed91b2be4de94dc395e4126a2386bd257e1865e but I do see one with 7b7e7e8cce78cd60ee82a71ffa3dc6b063a9d37c. Are you sure your change was saved?

I have made a new query:

prefix pr: <http://www.wikidata.org/prop/reference/>
prefix prov: <http://www.w3.org/ns/prov#>
prefix wdref: <http://www.wikidata.org/reference/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>  
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?work ?workLabel ?author ?authorLabel ?location ?locationLabel ?geo ?location_statement ?ref ?citat  WHERE {
?work wdt:P31/wdt:P279* wd:Q386724 . 
?work wdt:P50 ?author .
?work p:P840 ?location_statement .
?location_statement v:P840 ?location .
?location wdt:P17 wd:Q35 .
?location wdt:P625 ?geo . 
OPTIONAL {
  ?location_statement prov:wasDerivedFrom ?ref .
  ?ref pr:P1683 ?citat .
}

SERVICE wikibase:label { bd:serviceParam wikibase:language "da,en" . }
}

There the wdrefs are along. Note that there is some 'ordinary' duplicates, but I found one of the problematic duplicates:

wdref:d50e710b77340bcfbb2c3c32ae96faa8f066d625
"Jeg står ude i Tingbjerg på Ruten. Der er en mand, der er blevet skudt og dræbt.

and its duplicate:

wdref:c72f15c267545dcd6fb44bf7ad8c74df1fdb693d
"Jeg står ude i Tingbjerg på Ruten. Der er en mand, der er blevet skudt og dræbt. ..."

The information is in wd:Q21089337

https://www.wikidata.org/wiki/Special:EntityData/Q21089337.ttl?flavor=dump

I find :c72 in there, but :d50 is not there, so Wikidata is ok. Perhaps the dump to the SPARQL server does not erase the old one?

Thanks for the tricks.

Maybe I should note that the DISTINCT keyword clears the ordinary duplicates from my query result, but not problematic ones.

@Smalyshev sorry the example wasn't very concrete, I'll reproduce and link the steps here now!

On https://www.wikidata.org/w/index.php?title=Q4115189&oldid=268455300 the ref for the qualifier Q4115189$1154d042-4192-2895-2590-3a94aa96f747 has a hash of da74841d6aa4a1265212fdab587b27aaf35717d8.

When querying SELECT ?s WHERE {?s prov:wasDerivedFrom wdref:da74841d6aa4a1265212fdab587b27aaf35717d8} I get that 1 result!

Then this edit is made https://www.wikidata.org/w/index.php?title=Q4115189&diff=268471599&oldid=268455300 which changes the hash to b75b8d10cacdd56b4e5b9e140f063c3086185bd9

Runninng the query with the new hash returns the statement guid.
SELECT ?s WHERE {?s prov:wasDerivedFrom wdref:b75b8d10cacdd56b4e5b9e140f063c3086185bd9}
As does the original query (with the old hash)

This could possible be due to there being more lag removing things from the query service than adding things?
I mean, when I changed the reference I would immediately query it, but I could also still query / find the old one.

Deleting the reference removes both entries from the query service within a few seconds however.

I have hard time reproducing it... Maybe it's caused by the fact that the result of the query is cached by varnish or browser or some intermediate network caches, so when you ask the second time for the old hash you get cached result? Could you please to change query a little (like inserting whitespace or changing variable names, etc) and see if it still finds the old one?

So, doing the same steps again, the reference with has c3b41921174b34522c3bca5fd602c1f4d57ecbcd has been removed / changed.

https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=prefix+prov%3A+%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23%3E%0D%0Aprefix+wdref%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Freference%2F%3E%0D%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0APREFIX+wikibase%3A+%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0D%0APREFIX+p%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0D%0APREFIX+v%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fstatement%2F%3E%0D%0APREFIX+q%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fqualifier%2F%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0A%0D%0ASELECT+%3Fs+WHERE+%7B%3Fs+prov%3AwasDerivedFrom+wdref%3Ac3b41921174b34522c3bca5fd602c1f4d57ecbcd%7D

Still returns it.
I then changed one of the prefixes to try and avoid any cache hits etc.
prov -> provvv

https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=prefix+provvv%3A+%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23%3E%0D%0Aprefix+wdref%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Freference%2F%3E%0D%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0APREFIX+wikibase%3A+%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0D%0APREFIX+p%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0D%0APREFIX+v%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fstatement%2F%3E%0D%0APREFIX+q%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fqualifier%2F%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0A%0D%0ASELECT+%3Fs+WHERE+%7B%3Fs+provvv%3AwasDerivedFrom+wdref%3Ac3b41921174b34522c3bca5fd602c1f4d57ecbcd%7D

Per the response headers of the second request:

x-cache:cp1069 miss (0)
x-varnish:391682154

If you ping me on IRC I can try and reproduce it with you?

I note that labels may also have duplicates. If I recall my experiences correctly these duplicates tend to go away over time.

I am presently experiencing it for this query:

select ?feeling ?feelingLabel where {
  ?feeling wdt:P31 wd:Q9415 .
   SERVICE wikibase:label { bd:serviceParam wikibase:language "da,en" . } 
  }

I get "Lykke, lykke" and "Hovmod, hovmod" after my editing of the Danish label.

So I just thought I would give the following query a go which I saw on some page somewhere

prefix schema: <http://schema.org/>
SELECT * WHERE {<http://www.wikidata.org> schema:dateModified ?y}

As far as I know this gives me the last modified data for the data.

If I run it one time I get 2015-11-21T17:37:49.000Z, Run it again and again and I will get exactly the same result.
Add a little whitespace and the result will then update, so it looks like the issue that I reported could indeed be a caching issue?

After re running the query for 2 mins it looks like it was then actually run again and the result updates.

The same thing now happens for the reference hash queries, so I'm not sure exactly what happened when I tested exactly this above.

Requests to the regular MW api are not cached for reads so I do not see why they should be here either!
The MW api returns the header

cache-control:private, must-revalidate, max-age=0

I think disabling caching where appropriate would require the following change (thought I'm not sure exactly which repo & branch I should put this patch on!

https://github.com/wikimedia/wikidata-query-blazegraph/commit/8c3737237774be52275f03df79dab7a2dd5b5529

The only reason stuff is being cached right now is due to being behind varnish.
blazegraph itself would not expect these requests to be cached.

Update on this ticket - we've identified the cause (combination of updater using syntax which is questionable towards the standard and Blazegraph having a bug in that particular form of syntax with particular options that we use) and have a prospective solution, which I will implement soon and see if this problem is gone then.

See https://jira.blazegraph.com/browse/BLZG-1643 for details.

Change 255915 had a related patch set uploaded (by Smalyshev):
Fix deleting by using MINUS instead of FILTER NOT EXISTS

https://gerrit.wikimedia.org/r/255915

Change 255915 merged by jenkins-bot:
Fix deleting by using MINUS instead of FILTER NOT EXISTS

https://gerrit.wikimedia.org/r/255915

The fix is deployed, I will reload the database with fresh dump this week and then the issue should be resolved.

Closing as the fix is deployed and DB is reloaded. If you see it again please reopen and alert me.

I can confirm that I have not seen duplicates for some time now. The fix must have worked. Thanks!