Page MenuHomePhabricator

Implement a new MERGE INTO job that consumes the new reconciliation stream into wmf_dumps.wikitext_raw
Closed, ResolvedPublic

Description

Blocked by T368745: MediaWiki reconciliation API and event enrichment pipeline, and T368753: Implement production mechanism that emits (wiki_db, revision_id) pairs for missing or inaccurate rows.

Consuming these new events will require a new MERGE INTO job to be put together, very similar to the existing events_merge_into.py. In fact, hopefully just changing the source table in this pipeline should suffice as the schema should be the same.

In this task we should:

  • Implement the PySpark MERGE INTO job
  • Incorporate the running of this job as part of the Airflow DAG created on T368753.

Event Timeline

Actually, now that I think more about it, because we were able to reuse the page_content_change schema, this task may just be an Airflow task reusing the existing process_events.py (formerly events_merge_into.py), but pointing it to the new table from the reconciliation stream.

To test, we can override the DagProperty hive_wikitext_raw_table with xcollazo.wikitext_raw_rc2_wiki_partitioned_plus_bf_on_rev_id.

Change #1088275 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/deployment-charts@master] dse-k8s-services: mw-dump: version bump image

https://gerrit.wikimedia.org/r/1088275

To test, we can override the DagProperty hive_wikitext_raw_table with xcollazo.wikitext_raw_rc2_wiki_partitioned_plus_bf_on_rev_id.

f/up from a longer thread on slack.

I've been testing this DAG on a development Airflow instance and have validated that the following integrations work:

  • Events produced in the reconciliation topic are enriched.
  • Enriched reconciliation data is loaded into HDFS/Hive via the regular Gobblin event ingestion pipeline.
  • Page content, visibility, and enriched data are merged into the content history table with the https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/900 dag .
  • Upon reconciliation MERGE INTO, the content history table is updated and aligns with the reconciliation enrichment topics.

@xcollazo is there any other test you'd like to do at this stage?

@xcollazo is there any other test you'd like to do at this stage?

LGTM!

Change #1088275 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s-services: mw-dump: version bump image

https://gerrit.wikimedia.org/r/1088275

@gmodena this is WAD, correct? Can we close?