Page MenuHomePhabricator

Investigation: uniqueness of statement IDs within an entity
Open, Needs TriagePublic

Description

According to Wikidata documentation:

[stmt_id is] An arbitrary identifier for the Statement, which is unique across the repository.

But going to the Wikidata webpage for Secondary limb lymphedema (Q85046372) and looking in the page source, we can see that Q85046372$70E829CD-2D80-48D1-BB71-8EE2B5C22051 is referenced twice, both times with a different underlying data:

`<div id="Q85046372$70E829CD-2D80-48D1-BB71-8EE2B5C22051" class="wikibase-statementview wikibase-statement-Q85046372$70E829CD-2D80-48D1-BB71-8EE2B5C22051 wb-normal">...</div>
...
<div id="Q85046372$70E829CD-2D80-48D1-BB71-8EE2B5C22051" class="wikibase-statementview wikibase-statement-Q85046372$70E829CD-2D80-48D1-BB71-8EE2B5C22051 wb-normal">...</div>`

Both ids show up in cites work (P2860): Arm morbidity after sector resection and axillary dissection with or without postoperative radiotherapy in breast cancer stage I. Results from a randomised trial. Uppsala-Orebro Breast Cancer Study Group (Q73307092) and Case-control study to evaluate predictors of lymphedema after breast cancer surgery (Q37410695). 195.191.163.76 07:26, 26 July 2024 (UTC)

REST API response for the two P2860 statements with identical IDs:
curl -s https://www.wikidata.org/w/rest.php/wikibase/v0/entities/items/Q85046372 | jq '.statements.[].[] | select(.id == "Q85046372$70E829CD-2D80-48D1-BB71-8EE2B5C22051")'

For this investigation we must look into why this is happening, and why we are not enforcing the uniqueness of statement IDs within an entity.

Unlike T356161 the duplicated statement ids are found always in the same entity, trying to detect more of them in the RDF dumps in hadoop we can find 2 instances of this issue:

Q34433114 with Q34433114-684DE268-387D-4E42-8BD8-394C5C36D10C
Q85046372 with Q85046372-70E829CD-2D80-48D1-BB71-8EE2B5C22051

Note that the list above might be incomplete due to some deduplications that we perform while importing the dumps into hadoop.

This makes the RDF representation of these entities wrong by conflating two statements.

Event Timeline

Now 3!

id      claim_number
Q34433114$684DE268-387D-4E42-8BD8-394C5C36D10C  2
Q82477189$7BF24CAC-6BE9-4792-9FD3-A87AF4F76AB5  2
Q85046372$70E829CD-2D80-48D1-BB71-8EE2B5C22051  2

With query

SELECT
    claim.id,
    COUNT(1) AS claim_number
FROM
    wmf.wikidata_entity
LATERAL VIEW explode(claims) t AS claim
WHERE
    snapshot = '2024-08-26'
GROUP BY
    claim.id
HAVING
    COUNT(1) > 1;