Page MenuHomePhabricator

Duplicated data in queries
Closed, ResolvedPublic

Description

I am querying Wikidata for the number of speakers of the languages that have a Wikipedia and I have found out that my query returns some unexplainable duplicated entries.
For example the following entry is returned twice:
{'language': 'http://www.wikidata.org/entity/Q29921', 'languageLabel': 'Inuktitut', 'nSpeakers': '30000'}
This happens although I can find no duplication in the entry about that language and the 'distinct' in my query should have eliminated it anyway.
I suspect that it is a bug of the querying system, and not just me making a mistake, because the same query returns different results if executed few seconds apart. Data about Inuktitut language and Esperanto seems to be always affected, whereas data about other languages are duplicated only sometimes. For example, running the same query five seconds apart, I have obtained the following number of duplicates: 34, 34, 24, 24, 6, 6, 6, 24, 6, 33.

Here is a python 3 script that manifests the problem. It executes 10 queries 5 seconds apart (printing the duplicates of the first and the duplicate count of the others):

#!/usr/bin/env python3
#-*- coding: UTF-8 -*-

import requests, time

def queryWikidata(query):
    WIKIDATAQUERYURL = 'https://query.wikidata.org/sparql'
    data = requests.get(WIKIDATAQUERYURL, params={'format': 'json', 'query': query}).json()
    
    data = data["results"]['bindings']
    cleanData = []
    for i in data:
        cleanData.append({x: i[x]['value'] for x in i})       
    return cleanData
    


def testQuery(echo):
    QUERY = """SELECT DISTINCT ?language ?languageLabel ?nSpeakers ?Lx ?LxLabel ?time ?country
    WHERE
    {
        ?language wdt:P31/wdt:P279* wd:Q34770.
        ?language p:P1098 ?nSpeakersStatement.
        ?nSpeakersStatement ps:P1098 ?nSpeakers
        optional {?nSpeakersStatement pq:P518 ?Lx}.
        optional {?nSpeakersStatement pq:P585 ?time}.
        optional {?nSpeakersStatement pq:P17 ?country}.
        FILTER EXISTS {?wikipedia wdt:P407 ?language}.
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
        
    }"""
    foundDuplicates = []
    data = queryWikidata(QUERY)
    data.sort(key=lambda x: (x['language'], x['nSpeakers']))
    count=0
    for i in data:
        if data.count(i)>1 and i not in foundDuplicates:
            if echo == True:
                print(i)
            foundDuplicates.append(i)
    return len(foundDuplicates)

duplicateCount = []
duplicates = testQuery(True)
print('\nQuery', 1)
print('Duplicates:', duplicates)
duplicateCount.append(duplicates)
for i in range(2, 11):
    time.sleep(5)
    print('Query', i)
    duplicates = testQuery(False)
    print('Duplicates:', duplicates)
    duplicateCount.append(duplicates)
print(duplicateCount)

Event Timeline

The standard says (https://www.w3.org/TR/sparql11-query/#propertypath-syntaxforms):

Evaluation of a property path expression can lead to duplicates because any variables introduced in the equivalent pattern are not part of the results and are not already used elsewhere. They are hidden by implicit projection of the results to just the variables given in the query.

I suspect this is the reason for duplicate languages - path expression wdt:P31/wdt:P279* arrives to the result by different path, which produces duplicates. However, not sure why DISTINCT hasn't eliminated them.

Smalyshev triaged this task as Medium priority.Dec 13 2016, 9:26 PM

Workaround: add this to the query:

hint:Query hint:analytic false .
Smalyshev claimed this task.

Fixed in BG 2.1.5