As discussed during the 2018 offsite, it would be very useful to have all the XML dumps systematically upload to the hdfs.
There are a variety of use-cases for the result of this task. We list a few of them below:
* Measurement of AfC improvements (T192515)
* Short regular expressions that can take hours when going through XML dumps can be done in minutes. This has use-cases for a variety of research projects including [[https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements | identifying unsourced statements ]] (T186279).
* Section recommendation (T171224) : Here we need to parse section titles across different languages, this takes days with the current tools, and might be done in few hours using spark.
* Create a Historical Link Graph for Wikipedia (T186558): We want to keep a updated version of the historical link graph across different languages, as a complement for the Clickstream Dataset. Again, doing this Spark might reduce the parsing time from days to hours.