Page MenuHomePhabricator

MediaInfo does seem to allow entities to share same statement IDs
Closed, DuplicatePublic

Description

See in:

This is highly problematic as that the RDF model does not treat such statement as "shareable".
This breaks some components in the WCQS updater chain with:

java.lang.IllegalArgumentException: Cannot add/delete the same triple [(https://commons.wikimedia.org/entity/statement/M122879987-0E6FDA63-91C3-4566-84B7-1B0A460DFEE0, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://wikiba.se/ontology#BestRank)] for a different entities: [M69231551] and [M122879987]
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.lambda$findInvalidStatements$10(PatchAccumulator.java:117)
        at java.util.HashMap.forEach(HashMap.java:1290)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.findInvalidStatements(PatchAccumulator.java:114)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.accumulate(PatchAccumulator.java:94)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.accumulateDiff(PatchAccumulator.java:237)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.accumulate(PatchAccumulator.java:169)
        at org.wikidata.query.rdf.updater.consumer.KafkaStreamConsumer.poll(KafkaStreamConsumer.java:142)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdaterConsumer.lambda$run$0(StreamingUpdaterConsumer.java:60)
        at org.wikidata.query.rdf.common.TimerCounter.time(TimerCounter.java:51)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdaterConsumer.run(StreamingUpdaterConsumer.java:60)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdate.main(StreamingUpdate.java:51)

Details

Event Timeline

Tentatively setting to high as this will cause data consistency issues.
From the updater perspective we have to relax this component to allow such data (it will just report a warning and monitor such issues) but this is far from ideal.

Change 831538 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Relax the updater-consumer to allow duplicates

https://gerrit.wikimedia.org/r/831538

Change 831538 merged by jenkins-bot:

[wikidata/query/rdf@master] Relax the updater-consumer to allow duplicates

https://gerrit.wikimedia.org/r/831538

The consumer has been updated to work, but the underlying RDF's should be fixed. Relaxing the consumer means we've disabled sanity checks and in the long term the database will take on inconsistencies.