The mediawiki_wikitext_history job (Wikitech page, Airflow DAG, Scala code) frequently has a long gap between the end of the wait_for_pages_meta_history_xml_dump stage and the beginning of the convert_history_xml_to_parquet stage.This accounts for virtually all of the volatility in total job duration.
Gap size by run:
- 2024-01-01: none
- 2023-12-01: 2 d
- 2023-11-01: 4 d
- 2023-10-01 none
- 2023-09-01: 1 d
- 2023-08-01: none
- 2023-07-01: none
- 2023-06-01: 4 d
- 2023-05-01: none
- 2023-04-01: 1 d
As far as I know, this shouldn't happen. Eliminating it would significantly improve the speed of delivery of the movement metrics, which is the focus of the annual plan hypothesis SDS 2.6.2..