Page MenuHomePhabricator

Dumps 2.0 Phase II: Production intermediate table milestone
Closed, ResolvedPublic

Description

On T330296: Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark, we figured solutions for many of the technical risks associated with producing dumps via a set of data pipelines on top of our Hadoop Infrastructure. One of the outputs of that epic was the creating of the intermediate table wmf_content.mediawiki_content_history_v1. This is an Iceberg table containing all of the revisions of all of the wikis over all of wikitime, updated on an hourly basis.

This intermediate table has intrinsic value other than as a stepping stone for Dumps 2.0. This table is, effectively, a more up to date version of the existing wmf.mediawiki_wikitext_history, which is only updated once per month. This intermediate table thus has the potential to accelerate existing data pipelines from their typical ~19 days wait time to 1 hour 1 day (See T357859). (Note from future: instead of every hour, due to technical limitations we are doing daily updates, details at T377999).

In this epic, we include tasks to get this intermediate table to production grade.

Related document: Dumps 2.0 System Overview and Task Breakdown

Final deliverable here is a table documented at: https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Content/Mediawiki_content_history_v1

(After we finish here, we move on to T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2))

Related Objects

StatusSubtypeAssignedTask
Resolvedxcollazo
Resolvedxcollazo
DeclinedNone
Resolvedgmodena
Declinedgmodena
DeclinedNone
Resolvedxcollazo
Duplicatexcollazo
Resolvedxcollazo
Resolvedxcollazo
ResolvedMilimetric
ResolvedMilimetric
DeclinedNone
Declinedgmodena
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
DuplicateNone
ResolvedJMeybohm
Resolvedxcollazo
ResolvedMilimetric
DuplicateNone
Resolvedxcollazo
DuplicateNone
Resolvedxcollazo
ResolvedBUG REPORTxcollazo
Resolvedgmodena
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedgmodena
Resolvedtchin
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedgmodena
Resolvedgmodena
ResolvedAhoelzl
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedtchin
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
InvalidNone
Resolvedxcollazo
Resolvedxcollazo
DeclinedBTullis
Resolvedpfischer
Resolvedxcollazo
Resolvedgmodena
Resolvedamastilovic
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
ResolvedJAllemandou

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
xcollazo renamed this task from Dumps 2.0 - Production intermediate table milestone to Dumps 2.0 Phase II: Production intermediate table milestone.Jun 5 2024, 8:15 PM
xcollazo updated the task description. (Show Details)