Change Details

As discussed during the 2018 offsite, it would be very useful to have all the XML dumps systematically upload to the hdfs. **Use-cases** There are a variety of use-cases for the result of this task. We list a few of them below: * Measurement of AfC improvements (T192515) * Short regular expressions that can take hours when going through XML dumps can be done in minutes. This has use-cases for a variety of research projects including [[https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements | identifying unsourced statements ]] (T186279). * Section recommendation (T171224) : Here we need to parse section titles across different languages, this takes days with the current tools, and might be done in few hours using spark. * Create a Historical Link Graph for Wikipedia (T186558): We want to keep a updated version of the historical link graph across different languages, as a complement for the Clickstream Dataset. Again, doing this Spark might reduce the parsing time from days to hours. * Parse the XML dumps with the [mwparserfromhell library](https://mwparserfromhell.readthedocs.io/en/latest/) to figure out information box and language usage of files on Commons (Part 1 of T177358; [Results and code](https://github.com/wikimedia-research/SDoC-Initial-Metrics/tree/master/T177358-1)).