Page MenuHomePhabricator

WikibaseMediaInfo seems to reuse statement identifiers from other entities
Open, Needs TriagePublic

Description

Seen on M130887689 and M115086921 the content of the wikibase entity is almost identical.
The statement ids are the same which is highly problematic for the Wikibase RDF representation which assumes that a statement id is unique and belong to a single entity.
E.g. M130887689$83501cde-4a4b-a7d0-9832-5f1982be0c41 is referenced by both M130887689 & M115086921.

I'm not sure what actions have led to this situation but this should definitely be fixed to make sure that the statement ids are not shared.

AC:

  • identify what action caused an entity to re-use statement ids
  • determine if this problem affects Wikibase itself and wikidata
  • fix this behavior
  • cleanup existing entities that have non unique statement ids

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Probably related to all of this in the first file’s log:

File uploaded; cropped version uploaded with CropTool; page temporarily deleted for history splitting of overwritten files (G5); page undeleted (5 revisions); page moved to cropped without redirect; page undeleted again (9 revisions)

I think we’ve had problems with file undeletion and history splitting interacting badly with SDC before.

I think we’ve had problems with file undeletion and history splitting interacting badly with SDC before.

T338147, T231276, and T231646 are some of the tasks I found now

@Lucas_Werkmeister_WMDE thanks for all the context! I get that it only affects WikibaseMediaInfo. Can we exclude Wikibase as a culprit possibly affecting wikidata or should we run a quick investigation to find possible duplicated statement identifiers in the wikidata RDF dumps?

I doubt the same situation is possible on Wikidata, since we disallow moving items, properties or lexemes, and the moving seems to be a crucial part of how the history was split here.

dcausse renamed this task from WikibaseMediaInfo (or Wikibase?) seems to reuse statement identifiers from other entities to WikibaseMediaInfo seems to reuse statement identifiers from other entities.Jan 30 2024, 10:47 AM
dcausse updated the task description. (Show Details)

Scanning dumps from 2024/01/21 we can find 1623 duplicated statement ids (full list here: https://people.wikimedia.org/~dcausse/T356161_sdc_duplicated_statement_ids.csv)

Structured Data team has been tagged and notified, the Search Platform team is not going to follow up further.