Page MenuHomePhabricator

The streaming updater should identify all shared statements properly
Closed, ResolvedPublic

Description

As wdqs user I want triples shared by multiple entities to be treated separately in the streaming updater so that they are not deleted when an entity stops referencing them.

Some shared statements are still present in the rdf stream, these are identified at consumption but should be handled and categorized when producing them.

java.lang.IllegalArgumentException: Cannot add/delete the same triple for a different entity (should probably be considered as a shared statement)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.lambda$findInvalidStatements$6(PatchAccumulator.java:74)
        at java.util.HashMap.forEach(HashMap.java:1289)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.findInvalidStatements(PatchAccumulator.java:71)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.accumulate(PatchAccumulator.java:54)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.accumulate(PatchAccumulator.java:108)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at org.wikidata.query.rdf.updater.consumer.KafkaStreamConsumer.poll(KafkaStreamConsumer.java:131)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdaterConsumer.lambda$run$0(StreamingUpdaterConsumer.java:46)
        at org.wikidata.query.rdf.common.TimerCounter.time(TimerCounter.java:51)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdaterConsumer.run(StreamingUpdaterConsumer.java:46)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdate.main(StreamingUpdate.java:49)

AC:

  • the producer should identify all shared triples properly
  • the consumer should continue to fail when such triples are detected but the log message should be clearer and includes the triple and the entities it belongs to
  • bonus: the consumer should have a way to "fixup" these triples by "re-categorizing" them on the fly so that the rdf stream does not have to be re-generated

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 637544 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Add better logging to detect shared statements

https://gerrit.wikimedia.org/r/637544

Detailed exception is:

java.lang.IllegalArgumentException: Cannot add/delete the same triple [(https://ce.wikipedia.org/wiki/%D0%92%D0%B5%D1%80%D0%B8%D0%BD_%D0%A5%D0%BE%D1%82%D0%B0%D0%BD%D0%B0%D0%BD, http://schema.org/inLanguage, "ce"^^<http://www.w3.org/2001/XMLSchema#string>)] for a different entities: [Q2701416] and [Q25505610]
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.lambda$findInvalidStatements$10(PatchAccumulator.java:77)
        at java.util.HashMap.forEach(HashMap.java:1289)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.findInvalidStatements(PatchAccumulator.java:74)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.accumulate(PatchAccumulator.java:54)
        at org.wikidata.query.rdf.updater.consumer.PatchAccumulator.accumulate(PatchAccumulator.java:112)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at org.wikidata.query.rdf.updater.consumer.KafkaStreamConsumer.poll(KafkaStreamConsumer.java:134)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdaterConsumer.lambda$run$0(StreamingUpdaterConsumer.java:46)
        at org.wikidata.query.rdf.common.TimerCounter.time(TimerCounter.java:51)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdaterConsumer.run(StreamingUpdaterConsumer.java:46)
        at org.wikidata.query.rdf.updater.consumer.StreamingUpdate.main(StreamingUpdate.java:49)

Change 638603 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Reclassify sitelinks after diffing

https://gerrit.wikimedia.org/r/638603

Change 637544 merged by jenkins-bot:
[wikidata/query/rdf@master] Add better logging to detect shared statements

https://gerrit.wikimedia.org/r/637544

Change 638603 merged by jenkins-bot:
[wikidata/query/rdf@master] Reclassify sitelinks after diffing

https://gerrit.wikimedia.org/r/638603