For Dumps 2.0, we are generating an intermediate table, wmf_dumps.wikitext_raw, that has intrinsic value other than as a stepping stone.
This table will be, effectively, a more up to date version of the existing wmf.mediawiki_wikitext_history table. We intend to have this intermediate table updated every hour, while the existing wmf.mediawiki_wikitext_history table is only updated once per month. This intermediate table thus has the potential to accelerate existing data pipelines, as in the issue discussed in T357859.
So far though, the schema, data quality and availability for this table has only been discussed amongst the Data Platform Team.
In this task we should discuss the following with other internal teams:
- Double check schema is sufficient to replace mediawiki_wikitext_history. Schema is sufficient.
- What kind of data retention they are looking for. Is the ability to time travel over 90 days enough? 90 days is good.
- Although the target is for data to be ingested to this table on an hourly basis, for a myriad of reasons, this table can drift to be several hours away from MediaWiki production. What kind of data availability and data visibility do they need? Is T354761 and T357684 enough? It looks like the proposed DQ checks are good.