As discussed during the 2018 offsite, it would be very useful to have the [data dumps](https://meta.wikimedia.org/wiki/Data_dumps), which contain all past and present versions of Wikimedia project content, systematically uploaded to the [Analytics Data Lake](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake) so they can be queried using Hive, Spark, and other Hadoop ecosystem tools.
There are a variety of use-cases for the result of this task. We list a few of them below:
* Measurement of AfC improvements (T192515)
* Short regular expressions that can take hours when going through parsing through the dumps locally can be done in minutes. This has use-cases for a variety of research projects including [[https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements | identifying unsourced statements ]] (T186279).
* Section recommendation (T171224) : Here we need to parse section titles across different languages, this takes days with the current tools, and might be done in few hours using Spark.
* Create a Historical Link Graph for Wikipedia (T186558): We want to keep a updated version of the historical link graph across different languages, as a complement for the Clickstream Dataset. Again, doing this Spark might reduce the parsing time from days to hours.
* Parse the XML dumps with the [mwparserfromhell library](https://mwparserfromhell.readthedocs.io/en/latest/) to figure out information box and language usage of files on Commons (Part 1 of T177358; [Results and code](https://github.com/wikimedia-research/SDoC-Initial-Metrics/tree/master/T177358-1)).
* Calculating historical article counts (part of T194562): to use the [standard definition of an article](https://www.mediawiki.org/wiki/Manual:Article_count), you have to know which pages contained at least one wiki link at any given time in the past.