Brief summary
The Abstract Wikipedia initiative will make it possible to generate Wikipedia articles with a combination of community authored programming functions on a "wiki of functions" and the data and lexicographic (dictionary, grammar, etc.) knowledge on Wikidata .
Today the way community authored programming functions are used on different language editions of Wikipedia involves a lot of copying and pasting. If someone wants to calculate the age of someone for a biography in their native language, they may need to first go to English Wikipedia for example, find the community authored programming function that calculates ages in English, then copy and paste it to their non-English Wikipedia. This process is error prone, can lead to code duplication, and worse, improvements to functions on one language edition may not ever make their way to other language editions.
Wouldn't it be easier if all of these functions were instead available centrally and people didn't have to go through this manual process?
This Outreachy task is about an important first step: finding the different community authored functions that are out there and helping to prioritize which ones would be good candidates for centralizing for Abstract Wikipedia and its centralized wiki of functions.
Skills required
We think the successful Outreachy intern will know Python or potentially Scala or R.
The project
You will write code to:
- Fetch the different community authored functions (also known as "modules" and "templates") on the wikis, and determine their usage in articles and how many pageviews use each community authored function.
- Analyze the similarity between community authored functions hosted across different projects (i.e., looking for redundant or very similar code). This likely requires wireup of some open source packages, meaning the Python (or Scala or R) code probably needs to import some libraries that are good at looking at code similarity. But there are several potential approaches and this is part of the fun and the challenge.
- If there's enough time, determine whether there are segments of code that can be turned into pure functions in the wiki of functions, which is an interesting problem domain in computer science. This would likely require wireup with a Lua programming language interpreter (probably also using the wiki software Wikimedia maintains, MediaWiki) and some degree of manual spot checking to verify that identification of stuff for pure functions is correctly identified.
During this project, you get to write open source code, and you'll publish your methodology and a report that will become a subpage of the Abstract Wikipedia project page, which will be shared with community volunteers to aid in prioritization of things to turn into functions. The report can also potentially form the basis for a publishable research paper.
Possible mentor(s)
@DVrandecic, founder of Wikidata and project lead of Abstract Wikipedia at the Wikimedia Foundation
@dr0ptp4kt , Engineering Director at the Wikimedia Foundation
Microtask
To show your interest, we encourage you to try to solve this problem:
- Write a script that fetches all of the source code on English Wikipedia in the Module: "namespace". Hint: we have APIs that will make your life easier and you can find good examples of how to call the APIs by using your favorite search engine or looking at some of the tools on Toolforge. Please try to limit the number of page or API fetches to one per second. We also have dumps with the same content that can analyzed offline.
- Generate a summary report that includes interesting statistics like number of modules, a histogram of file sizes, and so on.