Page MenuHomePhabricator

Wikidata SPARQL endpoint is missing data that was added months ago.
Closed, DeclinedPublic

Description

SPARQL queries searching for items added months ago fail to locate the item. For example the disease item https://www.wikidata.org/wiki/Q18975220 has the statement P486 (MeSH ID) with string value "D004927".
Executing the following SPARQL query searching for the item with statement P486 and value "D004927" returns nothing.

https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%20%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20*%20WHERE%20%7B%0A%20%20%3Fdisease%20wdt%3AP486%20%20%22D004927%22%20.%0A%20%20%0A%7D.

If you execute the following query (searching for all items with MeSH IDs, 4652 results come back and the above item is not in that list.

https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%20%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20*%20WHERE%20%7B%0A%20%20%3Fdisease%20wdt%3AP486%20%20%3Fitem%20.%0A%20%20%0A%7D

If you then take an MeSH ID value from that list and query for that, it is returned as follows:

https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%20%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20*%20WHERE%20%7B%0A%20%20%3Fdisease%20wdt%3AP486%20%20%22D003093%22%20.%0A%20%20%0A%7D

This would indicate that the above item https://www.wikidata.org/wiki/Q18975220 is not in the SPARQL endpoint, however it is possible to query for it when searching for a Wikidata Item datatype like so:

https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%20%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20*%20WHERE%20%7B%0A%20%20%3Fdisease%20wdt%3AP828%20%20%20wd%3AQ21102933%20%20.%0A%20%20%0A%7D

Summary: The item is only found when querying for WD Item datatype claims. String datatype property claims are not queryable for this item, even though these claims have been on there since before August 2015.

Known scope of the issue:
Using the backlink api feature I returned all items that link to page Property:P486 and after parsing out user pages there were 5006 items with MeSH IDs (P486). As mentioned above the SPARQL query only returns 4652 results indicating almost 400 items are missing from the endpoint when searching by string datatype properties.

This is a very high priority for our Project Molecular Biology Bots because they identify items based on external identifiers.

Related Objects

Event Timeline

MicrobeBot raised the priority of this task from to High.
MicrobeBot updated the task description. (Show Details)
MicrobeBot added a subscriber: MicrobeBot.
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptFeb 2 2016, 7:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For Q18975220, the statement with MeshID D004927 is marked as "deprecated". Derecated statements do not show in wdt: properties, as they are considered to be not 'the best current truth". However, you can still retrieve them this way:

PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wd: <http://www.wikidata.org/entity/> 

SELECT * WHERE {
  ?disease p:P486/ps:P486  "D004927" .
  
}

(see http://tinyurl.com/zk8chdt)

Smalyshev closed this task as Declined.Feb 2 2016, 7:32 PM
Smalyshev claimed this task.

Summarily, if you want to get "current truth", use wdt:, if you want to know "any link", use p:/ps:. I know it's a bit confusing, but due to the variety of ways to represent aspects of knowledge this is the way it works now.

Please reopen if you have any data missing using p:/ps: lookup, I will check out that.

Thank you for your answer! That makes sense.