On T330296: Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark, we figured solutions for many of the technical risks associated with producing dumps via a set of data pipelines on top of our Hadoop Infrastructure. One of the outputs of that epic was the creating of the intermediate table wmf_content.mediawiki_content_history_v1. This is an Iceberg table containing all of the revisions of all of the wikis over all of wikitime, updated on an hourly basis.
This intermediate table has intrinsic value other than as a stepping stone for Dumps 2.0. This table is, effectively, a more up to date version of the existing wmf.mediawiki_wikitext_history, which is only updated once per month. This intermediate table thus has the potential to accelerate existing data pipelines from their typical ~19 days wait time to 1 hour 1 day (See T357859). (Note from future: instead of every hour, due to technical limitations we are doing daily updates, details at T377999).
In this epic, we include tasks to get this intermediate table to production grade.
Related document: Dumps 2.0 System Overview and Task Breakdown
Final deliverable here is a table documented at: https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Content/Mediawiki_content_history_v1
(After we finish here, we move on to T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2))