In T335860, we implemented a pyspark job that runs a MERGE INTO that transforms event data into a table that will eventually have all the mediawiki revision history.
This process currently ingests only recent events, and so we need a mechanism for backfilling it.
@Ottomata points to existing Spike effort to do this in Flink:
But now that we have a bit more experience with Iceberg and Spark MERGE INTO, I speculate a simple spark job manipulating wmf.mediawiki_history could do this without great effort. Table docs seem to suggest we got everything we need there, and so this will just reuse the same MERGE INTO pattern.
So in this spike we should:
- Play with wmf.mediawiki_history on a Notebook, see if we indeed have what we need there.
- Prototype a MERGE INTO that would ingest wmf.mediawiki_history into our hourly table (Also: figure out a better name for the target table other than 'hourly table'!).
- Do a run, see how long it takes? ( Backfill should only be run sporadically, but we should tune it anyways for when it is needed )