Objectives
- Airflow job is implemented that transforms data from the mediawiki.page-content-change stream to Apache Iceberg tables
-
Write the required XML files from Iceberg to dumps.wikimedia.org with at most a 2 day delay.Will be done in a separate epic.
Proposed High Level Architecture
Dependencies:
- mediawiki.page_content_change (expected go live end of March 2023)
Expected Sub Tasks:
- Document approach (why iceberg, why Airflow, etc) -> https://phabricator.wikimedia.org/T335859
- Build hourly Airflow job to transform data from mediawiki.page-content-change to Iceberg tables -> https://phabricator.wikimedia.org/T335860
- Spike on understanding how the XML transformation happens -> https://phabricator.wikimedia.org/T335861
- Build daily Airflow job to generate XML files from Iceberg tables -> https://phabricator.wikimedia.org/T335862
- Tests/QA checks that content is the same as existing dumps (part of https://phabricator.wikimedia.org/T335862)
Out of scope:
- Any jobs to reconcile missed events in the mediawiki.page-content-change stream. As part of QA/testing we will analyze the results to see if drift is an issue