The Enterprise HTML dumps are a very valuable resource for many research purposes (see T182351 for a more detailed explanation). While they are available as json-files on the stat-machines locally, parsing the whole dump is computationally very expensive and takes a lot of time. Could we add the dumps to hadoop to make bulk-processing feasible? I am thinking of something similar to the wikitext_current dumps (T238858).
Implementation Steps:
- Write up SLO (wikitech) for ingestion job
- Setup Data Engineering login for Enterprise access (ask Enterprise to turn limits off)
- Design Iceberg schema and deploy (schema for this version of the dumps is simpler than the file - we could possibly use that
- Check how much space this will take up and review with team (if too much we can look at prioritizing specific wikis)
- Files can be downloaded using the Enterprise snapshot API (need a pre-step to get available snapshots/projects)
- Build Airflow job to read files and load to Iceberg table
- Ensure that job retains only last 2 snapshots
Useful links:
- Docs to get Enterprise keys -> https://enterprise.wikimedia.com/docs/#getting-api-keys
- Docs to download Snapshot files -> https://enterprise.wikimedia.com/docs/snapshot/