Implement job to generate Dump XML files
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	• lbowmaker
	May 3 2023, 1:36 PM

Description

User Story

As a data engineer, I need to build an Airflow job to generate XML files from the data generated in this ticket, so that I can check if the output of the new process matches the existing dump process

Done is:

Job is running on daily schedule on Airflow
We can limit the scope of this to 1 smaller wiki to make testing easier
Output of the process matches the output of the existing process (1 wiki)

Out of scope:

Publishing to dumps.wikimedia.org (once we test further and can increase the number of wikis this is running for then we can publish)

Details

Other Assignee: Milimetric

	Subject	Repo	Branch	Lines +/-
	Create a job to dump XML/SQL MW history files to HDFS	analytics/refinery/source	master	+2 K -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Milimetric	T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
Open	None	T346147 Generate XML dumps for simplewiki
Resolved	Milimetric	T335862 Implement job to generate Dump XML files
Resolved	VirginiaPoundstone	T344690 [Spike] Quantify pages and revisions as relevant to dumps
Resolved	Milimetric	T344691 [Spike] Understand how "large" pages (with lots of revisions) are problematic when writing XML to Hadoop
Resolved	Milimetric	T344693 Understand Hadoop OutputFormat and how to solve the problem
Resolved	Milimetric	T346378 Update XML dump generation code to use wmf_dumps.wikitext_raw_rc1 schema.

Event Timeline

• lbowmaker renamed this task from Implement job to generate XML files to Implement job to generate Dump XML files.May 3 2023, 1:36 PM

• lbowmaker created this task.

• lbowmaker mentioned this in T330296: Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark.

• lbowmaker moved this task from Backlog to To be discussed /To be estimated on the Data Pipelines board.

xcollazo updated the task description. (Show Details)May 10 2023, 6:40 PM

• lbowmaker set the point value for this task to 5.May 30 2023, 1:34 PM

• lbowmaker moved this task from To be discussed /To be estimated to Sprint 14 on the Data Pipelines board.

• lbowmaker edited projects, added Data Pipelines (Sprint 14); removed Data Pipelines.

Antoine_Quhen claimed this task.Jun 29 2023, 4:05 PM

Antoine_Quhen moved this task from Next Up to In Progress on the Data Pipelines (Sprint 14) board.

JArguello-WMF added a project: Data Engineering and Event Platform Team (Sprint 0).Jun 30 2023, 4:51 PM

JArguello-WMF moved this task from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 0) board.Jun 30 2023, 5:10 PM

Change 938941 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] WIP: Create a job to dump XML/SQL MW history files to HDFS

https://gerrit.wikimedia.org/r/938941

gerritbot added a project: Patch-For-Review.Jul 17 2023, 9:47 PM

I have the first draft version in Gerrit.

About the partitioner, I have to manually import serializer utils from Spark (to mimic the Range Partitionner).
About the XML files creation, the creation of 1 file per partition with a custom name is not working

Also, the current content is limited to the revision contents.

I have the first draft version in Gerrit.

Given this is new code for a new project, do you think we could move it to GitLab? We had created this new Gitlab subgroup to try and keep dumps code close: https://gitlab.wikimedia.org/repos/data-engineering/dumps.

• lbowmaker moved this task from Sprint 0 to Sprint 1 on the Data Engineering and Event Platform Team board.Jul 25 2023, 12:26 PM

• lbowmaker edited projects, added Data Engineering and Event Platform Team (Sprint 1); removed Data Engineering and Event Platform Team (Sprint 0).

• lbowmaker moved this task from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 1) board.

Milimetric added a project: Experimentation Lab.Jul 25 2023, 1:39 PM

Milimetric moved this task from Incoming to Sprint 0 on the Experimentation Lab board.

Milimetric edited projects, added Experimentation Lab (Sprint 0); removed Experimentation Lab.

xcollazo moved this task from Next Up to In Progress on the Experimentation Lab (Sprint 0) board.Jul 25 2023, 1:44 PM

phuedx moved this task from In Progress to Blocked/Paused on the Experimentation Lab (Sprint 0) board.Aug 14 2023, 12:57 PM

OK to move to Gitlab. 👍 I'm making it work first.

Milimetric moved this task from Sprint 0 to Sprint 00 on the Experimentation Lab board.Aug 22 2023, 2:17 PM

Milimetric edited projects, added Experimentation Lab (Sprint 00); removed Experimentation Lab (Sprint 0).

VirginiaPoundstone triaged this task as High priority.Aug 23 2023, 10:31 PM

• WDoranWMF changed the point value for this task from 5 to 8.Aug 24 2023, 2:22 PM

Milimetric moved this task from Sprint Backlog to In Process on the Experimentation Lab (Sprint 00) board.Aug 28 2023, 1:17 PM

What has been done in a first step:

Custom partitioner POC
First implementation
Clarifying source & result expectation

Then, following comments from @Milimetric, @xcollazo & @joal, what has already been done:

Change: Switch from 1 custom writer per partition with an accumulator by page (which could grow too much) to using the standard Spark writer
Debugging on cluster

And what remains to be done:

Add unit tests (aggregateBySize really needs it, pageIdBounds also)
Only pass wikiDB to XML fragment (no params)
Add new trait pageBoundariesAware
Add new class: PageBoundariesDefiner
Add new case class: PageBoundary
Switch to computing the list of page bounds only once and use an identity partitioner
Add more logging & comments
Mutualize method archiveData from mediawiki history dumper (trait ?)
Tests on cluster with compile + launch with jar (small wiki + wikidata/enwiki)

thanks @Antoine_Quhen! I'll check in with @Milimetric and we can pull the remaining stuff forward into our next sprint.

Milimetric moved this task from In Progress to In Review on the Data Pipelines (Sprint 14) board.Sep 7 2023, 1:19 PM

Milimetric moved this task from In Process to In code review / Tech Input on the Experimentation Lab (Sprint 00) board.

I have done part of the refactor in this change: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/2..3
including:

Adding the unit tests on the important part of the code
Add some new traits, and classes and rename some classes for better comprehension

The next big steps are:

Switch to computing the list of page bounds only once (done) and use an identity partitioner
Mutualize method archiveData from MediaWiki history dumper (trait ?)
Tests on the cluster with compile + launch with jar (small wiki + wikidata/enwiki)