As discussed during the 2018 offsite, it would be very useful to have the data dumps, which contain all past and present versions of Wikimedia project content, systematically uploaded to the Analytics Data Lake so they can be queried using Hive, Spark, and other Hadoop ecosystem tools.
Use-cases
There are a variety of use-cases for the result of this task. We list a few of them below:
- Measurement of AfC improvements (T192515)
- Short regular expressions that can take hours when going through parsing through the dumps locally can be done in minutes. This has use-cases for a variety of research projects including identifying unsourced statements (T186279).
- Section recommendation (T171224) : Here we need to parse section titles across different languages, this takes days with the current tools, and might be done in few hours using Spark.
- Create a Historical Link Graph for Wikipedia (T186558): We want to keep a updated version of the historical link graph across different languages, as a complement for the Clickstream Dataset. Again, doing this Spark might reduce the parsing time from days to hours.
- Parse the XML dumps with the mwparserfromhell library to figure out information box and language usage of files on Commons (Part 1 of T177358; Results and code).
- Calculating historical article counts (part of T194562): to use the standard definition of an article, you have to know which pages contained at least one wiki link at any given time in the past.