Page MenuHomePhabricator

Update WME with the missed events from Oct 20
Closed, ResolvedPublic

Description

We have identified 1521 events that were missed from Oct 20.
Refer to this ticket and wikimedia-enterprise/experiments/data-consistency-check repo for more details and dataset.

We have 2 good options for ingesting those 1521 events (wikimedia-enterprise/experiments/data-consistency-check/-/blob/main/complete_dataset/missed.csv).

  1. Using bulk-ingestion workflow:

Bulk-ingestion is largely a 2 step process. Gathering all relevant article names into article-names topic. article-bulk service processes these events from article-names topic and updates WME dataset. Typically gathering of articles happens by using "allpages" API (full ingestion of a project/namespace). We want to only update those 1521 events. Create a small script that publishes those article names from complete_dataset/missed.csv to article-names topic and start article-bulk service to process them.

  1. Using "internal events"

We have admin [[ wikimedia-enterprise/services/dags/-/blob/main/pipelines/admin/admin.py | dag ]] (that calls admin grpc service) to either delete or update articles in WME system. This is almost like receiving events from eventstream, except this is generated internally and not propagated to realtime.
The refresh endpoint that updates the WME dataset is not implemented yet. This option will require implementing refresh and updating dag to be able to update as well. You can then use wikimedia-enterprise/experiments/data-consistency-check/-/blob/main/complete_dataset/missed.csv to generate internal events.
Refer to admin.proto here.

To do

  • Update WME using one of the options above for the missed events.
  • Re-run wikimedia-enterprise/experiments/data-consistency-check script with missed events (from missed.csv) as input. Missed events should now be present in WME. New missed will be 0.

Acceptance criteria

  • New missed events is 0

Event Timeline

we probably should do option 2 for the long run.
But to recover from our current state. I will recommend going forward with option 1.

There's no need to update with events, about 6% (94/1522) of the originally missing articles that have not been re-ingested.