Change Details

NOTE: Original intent of ticket was to ingest https://dumps.wikimedia.org/other/enterprise_html/ but these files are experimental and not fully supported. To implement this ticket we plan to go directly to Enterprise for the files. The [[ https://dumps.wikimedia.org/other/enterprise_html/ | Enterprise HTML dumps ]] are a very valuable resource for many research purposes (see T182351 for a more detailed explanation). While they are available as json-files on the stat-machines locally, parsing the whole dump is computationally very expensive and takes a lot of time. Could we add the dumps to hadoop to make bulk-processing feasible? I am thinking of something similar to the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Mediawiki_wikitext_current | wikitext_current ]] dumps (T238858). **Implementation Steps:** [] Write up SLO (wikitech) for ingestion job [] Enterprise plans to chunk the files in Q4 (check to see if this makes any difference to our procesSetup Data Engineering login for Enterprise access (ask Enterprise to turn limits off) [] Design Iceberg schema and deploy (schema for this [[ https://dumps.wikimedia.org/other/enterprise_html/ | version of the dumps ]] is simpler than the file - we could possibly use that [] Check how much space this will take up and review with team (if too much we can look at prioritizing specific wikis) [] Design Iceberg schema and deployFiles can be downloaded using the Enterprise snapshot API (need a pre-step to get available snapshots/projects) [] Build Airflow job to read files and load to Iceberg table [] Ensure that job retains only last 2 snapshots **Useful links:** - Docs to get Enterprise keys -> https://enterprise.wikimedia.com/docs/#getting-api-keys - Docs to download Snapshot files -> https://enterprise.wikimedia.com/docs/snapshot/