Page MenuHomePhabricator

Implement job to generate Dump XML files
Closed, ResolvedPublic8 Estimated Story Points

Description

User Story
As a data engineer, I need to build an Airflow job to generate XML files from the data generated in this ticket, so that I can check if the output of the new process matches the existing dump process
Done is:
  • Job is running on daily schedule on Airflow
  • We can limit the scope of this to 1 smaller wiki to make testing easier
  • Output of the process matches the output of the existing process (1 wiki)
Out of scope:
  • Publishing to dumps.wikimedia.org (once we test further and can increase the number of wikis this is running for then we can publish)

Event Timeline

lbowmaker renamed this task from Implement job to generate XML files to Implement job to generate Dump XML files.May 3 2023, 1:36 PM
lbowmaker created this task.

Change 938941 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] WIP: Create a job to dump XML/SQL MW history files to HDFS

https://gerrit.wikimedia.org/r/938941

I have the first draft version in Gerrit.

  • About the partitioner, I have to manually import serializer utils from Spark (to mimic the Range Partitionner).
  • About the XML files creation, the creation of 1 file per partition with a custom name is not working

Also, the current content is limited to the revision contents.

I have the first draft version in Gerrit.

Given this is new code for a new project, do you think we could move it to GitLab? We had created this new Gitlab subgroup to try and keep dumps code close: https://gitlab.wikimedia.org/repos/data-engineering/dumps.

OK to move to Gitlab. 👍 I'm making it work first.

WDoranWMF changed the point value for this task from 5 to 8.Aug 24 2023, 2:22 PM

What has been done in a first step:

  • Custom partitioner POC
  • First implementation
  • Clarifying source & result expectation

Then, following comments from @Milimetric, @xcollazo & @joal, what has already been done:

  • Change: Switch from 1 custom writer per partition with an accumulator by page (which could grow too much) to using the standard Spark writer
  • Debugging on cluster

And what remains to be done:

  • Add unit tests (aggregateBySize really needs it, pageIdBounds also)
  • Only pass wikiDB to XML fragment (no params)
  • Add new trait pageBoundariesAware
  • Add new class: PageBoundariesDefiner
  • Add new case class: PageBoundary
  • Switch to computing the list of page bounds only once and use an identity partitioner
  • Add more logging & comments
  • Mutualize method archiveData from mediawiki history dumper (trait ?)
  • Tests on cluster with compile + launch with jar (small wiki + wikidata/enwiki)

thanks @Antoine_Quhen! I'll check in with @Milimetric and we can pull the remaining stuff forward into our next sprint.

I have done part of the refactor in this change: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/2..3
including:

  • Adding the unit tests on the important part of the code
  • Add some new traits, and classes and rename some classes for better comprehension

The next big steps are:

  • Switch to computing the list of page bounds only once (done) and use an identity partitioner
  • Mutualize method archiveData from MediaWiki history dumper (trait ?)
  • Tests on the cluster with compile + launch with jar (small wiki + wikidata/enwiki)

Merging this now, with the understanding that there's work to do to make the XML match perfectly in all cases. That's the job of other tasks.

Change 938941 merged by jenkins-bot:

[analytics/refinery/source@master] Create a job to dump XML/SQL MW history files to HDFS

https://gerrit.wikimedia.org/r/938941