The [[ https://dumps.wikimedia.org/other/enterprise_html/ | Enterprise HTML dumps ]] are a very valuable resource for many research purposes (see T182351 for a more detailed explanation). While they are available as json-files on the stat-machines locally, parsing the whole dump is computationally very expensive and takes a lot of time. Could we add the dumps to hadoop to make bulk-processing feasible? I am thinking of something similar to the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Mediawiki_wikitext_current | wikitext_current ]] dumps (T238858).
**Implementation Steps:**
[] Write up SLO (wikitech) for ingestion job
[] Enterprise plans to chunk the files in Q4 (check to see if this makes any difference to our process)
[] Design Iceberg schema and deploy
[] Build Airflow job to read files and load to Iceberg table
[] Ensure that job retains only last 2 snapshots