Page MenuHomePhabricator

Add a reconciliation strategy to the wdqs streaming updater
Closed, ResolvedPublic13 Estimated Story Points

Description

As a user, when the Wikidata updates fail (on multiple tries, etc), I want the data to eventually happen somehow.

As a maintainer of WDQS I want the streaming updater to be able to reconcile a wikibase item so that I can fix some inconsistencies without reloading the full database.

This can be achieved by introducing a new topic the streaming updater would consume and would contain a message indicating if an item needs to be reconciled or deleted given a specific revision.

This can be used to reconcile missed events (MW bugs, missing events, late events) or failures when fetching the item data.

When a deletion is required existing code will be used.
When the item to reconcile exists the mutation message will contain all the entity data and the consumer will perform a full reconciliation.

Automatic reconciliation (probably via a batch running from the analytics cluster) should be possible reading side-outputs:

Ad-hoc reconciliation should be possible via a script (or possibly from wikibase itself if this is deemed necessary).

The schema of this new topic should be as follow:

  • meta: typical event metadata
  • item: string the wikibase item to update
  • revision: long the revision to work with
  • type: enum: create or delete

The flink operator determining the mutation to apply should be changed to support new conditions:

  • if the revision in the message is older than the one seen in the state then an operation corresponding to the state is emitted:
    • reconcile if the state is CREATED using the revision seen and fetch the data from this revision
    • delete if the state is DELETED
  • if the revision in the message is newer than the one seen in the state (or never seen) then an operation corresponding to the message is emitted:
    • reconcile if the message has a type create using the revision from the message
    • delete if the message has a type delete

AC:

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MPhamWMF moved this task from Incoming to Scaling on the Wikidata-Query-Service board.
MPhamWMF set the point value for this task to 13.Oct 25 2021, 3:45 PM

Change 737429 had a related patch set uploaded (by DCausse; author: DCausse):

[schemas/event/secondary@master] [WIP] rdf-streaming-updater: add a \"reconcile\" operation

https://gerrit.wikimedia.org/r/737429

Change 737436 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] model: add a \"reconcile\" operation to MutationEventData

https://gerrit.wikimedia.org/r/737436

Change 737436 merged by jenkins-bot:

[wikidata/query/rdf@master] model: add a \"reconcile\" operation to MutationEventData

https://gerrit.wikimedia.org/r/737436

Change 739105 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] consumer: add support for reconciliation

https://gerrit.wikimedia.org/r/739105

Change 739915 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Add a reconcile event platform event

https://gerrit.wikimedia.org/r/739915

Change 739916 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] [WIP] producer: add support for reconciliation

https://gerrit.wikimedia.org/r/739916

Change 740109 had a related patch set uploaded (by DCausse; author: DCausse):

[schemas/event/primary@master] rdf_streaming_updater: add a reconcile event schema

https://gerrit.wikimedia.org/r/740109

Change 739105 merged by jenkins-bot:

[wikidata/query/rdf@master] consumer: add support for reconciliation

https://gerrit.wikimedia.org/r/739105

Change 739915 merged by jenkins-bot:

[wikidata/query/rdf@master] Add a reconcile event platform event

https://gerrit.wikimedia.org/r/739915

Change 752184 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Add a spark job to emit reconciliation events

https://gerrit.wikimedia.org/r/752184

Change 753788 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] rdf-streaming-updater: add the reconciliation stream

https://gerrit.wikimedia.org/r/753788

Change 753791 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia/discovery/analytics@master] [WIP] Schedule rdf-streaming-updater reconciliation job

https://gerrit.wikimedia.org/r/753791

Change 756536 had a related patch set uploaded (by DCausse; author: DCausse):

[schemas/event/secondary@master] rdf_streaming_updater: add a reconcile event schema

https://gerrit.wikimedia.org/r/756536

Change 740109 abandoned by DCausse:

[schemas/event/primary@master] rdf_streaming_updater: add a reconcile event schema

Reason:

will continue using the secondary repo

https://gerrit.wikimedia.org/r/740109

Change 737429 merged by jenkins-bot:

[schemas/event/secondary@master] rdf-streaming-updater: add a \"reconcile\" operation

https://gerrit.wikimedia.org/r/737429

Change 756536 merged by jenkins-bot:

[schemas/event/secondary@master] rdf_streaming_updater: add a reconcile event schema

https://gerrit.wikimedia.org/r/756536

Change 757665 had a related patch set uploaded (by DCausse; author: DCausse):

[eventgate-wikimedia@master] Bump secondary schema repo to 117c3fa

https://gerrit.wikimedia.org/r/757665

Change 757665 merged by Ottomata:

[eventgate-wikimedia@master] Bump secondary schema repo to 117c3fa

https://gerrit.wikimedia.org/r/757665

Change 757667 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] eventgate-main: update image to 2022-01-27-143826-production

https://gerrit.wikimedia.org/r/757667

Change 757667 merged by Ottomata:

[operations/deployment-charts@master] eventgate-main: update image to 2022-01-27-143826-production

https://gerrit.wikimedia.org/r/757667

Change 739916 merged by jenkins-bot:

[wikidata/query/rdf@master] producer: add support for reconciliation

https://gerrit.wikimedia.org/r/739916

Change 752184 merged by jenkins-bot:

[wikidata/query/rdf@master] Add a spark job to emit reconciliation events

https://gerrit.wikimedia.org/r/752184

Change 758827 had a related patch set uploaded (by DCausse; author: DCausse):

[eventgate-wikimedia@master] Bump secondary schema repo to 52e2206

https://gerrit.wikimedia.org/r/758827

Change 758827 merged by Ottomata:

[eventgate-wikimedia@master] Bump secondary schema repo to 52e2206

https://gerrit.wikimedia.org/r/758827

Change 758926 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] eventgate-main: update image to 2022-02-01-141357-production

https://gerrit.wikimedia.org/r/758926

Change 758926 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main: update image to 2022-02-01-141357-production

https://gerrit.wikimedia.org/r/758926

Change 753788 merged by jenkins-bot:

[operations/mediawiki-config@master] rdf-streaming-updater: add the reconciliation stream

https://gerrit.wikimedia.org/r/753788

Mentioned in SAL (#wikimedia-operations) [2022-02-02T01:03:13Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753788|rdf-streaming-updater: add the reconciliation stream (T279541)]] (duration: 00m 49s)

Change 753791 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Schedule rdf-streaming-updater reconciliation job

https://gerrit.wikimedia.org/r/753791

EBernhardson subscribed.

Airflow DAG has been deployed. I have left it turned off for now, when ready someone will need to enable it (and potentially update the start_date).

Change 761647 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/deploy@master] Add configuration for the reconciliation topic

https://gerrit.wikimedia.org/r/761647

Change 761647 merged by DCausse:

[wikidata/query/deploy@master] Add configuration for the reconciliation topic

https://gerrit.wikimedia.org/r/761647

Change 763514 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia/discovery/analytics@master] Update rdf-spark-tools artifact and set proper start dates...

https://gerrit.wikimedia.org/r/763514

Change 763514 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Update rdf-spark-tools artifact and set proper start dates...

https://gerrit.wikimedia.org/r/763514

Deployment of this feature has been stopped due to T302340.

Change 772906 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia/discovery/analytics@master] rdf_streaming_updater_reconcile: set proper start dates

https://gerrit.wikimedia.org/r/772906

Change 772906 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] rdf_streaming_updater_reconcile: set proper start dates

https://gerrit.wikimedia.org/r/772906