Page MenuHomePhabricator

Daily updated wmf_content.mediawiki_content_current_v1
Closed, ResolvedPublic

Description

On T366544, we spiked how to best build a pipeline for creating wmf_content.mediawiki_content_current_v1.

In this epic, we want to attach all tasks for getting this PoC to a production level table.

This is done to support SDS 1.4.1:

Hypothesis: “If we provide a daily updated table wmf_content.mediawiki_content_current_v1 in the datalake that includes the content of the current revision for all pages for all wikis, we will then simplify the integration work and reduce compute resources necessary for downstream consumers that only care about the latest state.”

Related Objects

Event Timeline

Ahoelzl changed the task status from Open to In Progress.Apr 25 2025, 4:35 PM
Ahoelzl triaged this task as High priority.
Ahoelzl moved this task from Incoming (new tickets) to Tag with Roadmap on the Data-Engineering board.
Ahoelzl edited projects, added Data-Engineering-Roadmap; removed Data-Engineering.
Ahoelzl moved this task from Backlog to Q4 FY24-25 on the Data-Engineering-Roadmap board.

Copy pasting final Asana report:

Hypothesis: “If we provide a daily updated table wmf_content.mediawiki_content_current_v1 in the datalake that includes the content of the current revision for all pages for all wikis, we will then simplify the integration work and reduce compute resources necessary for downstream consumers that only care about the latest state.”

Confirm whether the hypothesis was supported or contradicted
The hypothesis is supported.

Briefly describe what was accomplished over the course of the hypothesis work (list of deliverables, links to documents, etc.)
Over the course of this hypothesis work we delivered the following:

  • A datalake table named wmf_content.mediawiki_content_current_v1 , which containes the current revision for all pages for all wikis, to be used for internal consumption, updated daily. Typical uses cases include: research pipelines, structured content pipelines, ad hoc querying, and also an intemediate table for doing File Export (aka Dumps 2).
  • Various experiments to make this pipeline fast and stable.
  • Automated data quality metrics to make sure the data that we put in this table is good for downstream consumers.
  • Consultation work with downstream consumers.
  • Further info can be obtained from the phabricator epic: https://phabricator.wikimedia.org/T391279.

Major lessons
The work, and the learnings from SDS 1.3.2.B (Daily updated MW history wikitext content data lake table) clearly made this work easier and smoother than it would have otherwise been, yet we still struggled to make certain performance improvements at the Apache Iceberg (a datalake table library) level. We continue to learn how to scale this technology given our needs to process ~25 TB of data each day we update this table. Having said that, Apache Iceberg continues to allow us to tackle use cases that we could simply not do 2 years ago.