As part of SDS 2.6.2, I've been investigating the data dependencies of the movement metrics. Our critical path takes around 25 days and goes:
- XML dumps generation
- loading XML dumps to HDFS (Python script, template for running script, Puppet management of SystemD timers running script)
- mediawiki_wikitext_history
- research_article_quality (Airflow DAG, code)
- knowledge_gaps (Airflow DAG, code)
By far the longest portion (~19 days) is waiting for the XML dumps to be generated. But after the first 7 days (when the English Wikipedia dump arrives), we're waiting only on the Wikidata dump. I doubt that anyone is regularly using the Wikidata XML dump since wmf.wikidata_entity (which comes from the JSON dump) is much better and faster. The XML dump is apparently the only one that contains non-current data, but that's probably a very rare need.
Can we skip loading the Wikidata XML altogether? Other strategies like splitting it out as a separate job would be fine too, but just skipping it would be much easier and likely fine, with no one using the data.