Page MenuHomePhabricator

Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
Closed, ResolvedPublic

Description

Objectives

  • Airflow job is implemented that transforms data from the mediawiki.page-content-change stream to Apache Iceberg tables
  • Write the required XML files from Iceberg to dumps.wikimedia.org with at most a 2 day delay. Will be done in a separate epic.

Proposed High Level Architecture

Dumps 2.0 Block Diagram.jpg (3×4 px, 335 KB)

Dependencies:

  • mediawiki.page_content_change (expected go live end of March 2023)

Expected Sub Tasks:

Out of scope:

  • Any jobs to reconcile missed events in the mediawiki.page-content-change stream. As part of QA/testing we will analyze the results to see if drift is an issue

Related Objects

StatusSubtypeAssignedTask
OpenVirginiaPoundstone
ResolvedMilimetric
ResolvedMilimetric
ResolvedSpikeMilimetric
ResolvedMilimetric
ResolvedVirginiaPoundstone
ResolvedMilimetric
ResolvedMilimetric
ResolvedMilimetric
Resolvedxcollazo
Resolvedxcollazo
DuplicateNone
Resolvedxcollazo
DuplicateNone
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
ResolvedJEbe-WMF
Resolvedxcollazo
Resolvedxcollazo
DuplicateNone
Resolvedxcollazo
Resolvedxcollazo
ResolvedJEbe-WMF
Resolvedxcollazo
Resolvedxcollazo
ResolvedJEbe-WMF
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Milimetric renamed this task from Make Realtime MediaWiki XML content dump available for external  consumption to Make MediaWiki XML content dump available for external consumption.May 31 2023, 7:11 PM

Removing "Realtime" from the task description to reflect the decision to delay focusing on that for now, as dumps itself doesn't strictly need it, and we're still getting settled with the technology we'd need to implement it.

Also maybe remove 'XML' from the task description? We aren't sure yet?

(updated description with newer block diagram F38950232)

xcollazo renamed this task from Make MediaWiki XML content dump available for external consumption to Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark.Feb 29 2024, 6:44 PM
xcollazo updated the task description. (Show Details)