The mediawiki_wikitext_history job ([Wikitech page](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Mediawiki_wikitext_history), [Airflow DAG](https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/mediawiki/wikitext/mediawiki_wikitext_history_dag.py), [Scala code](https://github.com/wikimedia/analytics-refinery-source/blob/ac1e23931bb467cea184c3ec1901db64a798d56f/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/MediawikiXMLDumpsConverter.scala)) frequently has a long gap between the //end// of the `wait_for_pages_meta_history_xml_dump` stage and the //beginning// of the `convert_history_xml_to_parquet` stage.This accounts for virtually all of the volatility in total job duration.
Gap size by run:
- 2024-01-01: none
- 2023-12-01: 2 d
- 2023-11-01: 4 d
- 2023-10-01 none
- 2023-09-01: 1 d
- 2023-08-01: none
- 2023-07-01: none
- 2023-06-01: 4 d
- 2023-05-01: none
- 2023-04-01: 1 d
As far as I know, this shouldn't happened. Eliminating it would significantly improve the speed of delivery of the [movement metrics](https://meta.wikimedia.org/wiki/Movement_Insights/Movement_metrics), which is the focus of the annual plan hypothesis [SDS 2.6.2.](https://docs.google.com/document/d/1iTgL8V7FNb1VG_mWp6U_2SmWN4F2BEm789BSvvYq3PY/edit).