The Wikimedia Foundation runs a service called ORES that hosts machine learning models that can make predictions about various forms of content on Wikipedia -- e.g., the likelihood that a given edit is vandalism or how the quality of a Wikipedia article. One of the newer models is one that can label Wikipedia articles with a set of pre-defined topics -- e.g., the English Wikipedia article for sci-fi author N. K. Jemisin is predicted to be part of the following topics:
- Culture.Biography.Biography* (the article is a biography)
- Culture.Biography.Women (the biography is about a woman)
- Culture.Literature (she's an author).
The challenge is that this model only works for English Wikipedia, and while efforts are being made to expand it to more languages, this is difficult. To overcome this challenge, a separate model was developed to make predictions based not on articles but on Wikidata (loosely a database of facts about concepts that have Wikipedia articles -- e.g., the Wikidata item for N.K. Jemisin). This model can be used to generate topic predictions for Wikipedia articles in any language based on its associated Wikidata item (yay!). We have developed an experimental API but this project will rewrite the code for this model so that it works in the production-level ORES environment. While this is primarily an engineering task, there will also be opportunities for machine learning and data science as desired.
- Python coding -- the code for this API will be in Python so at least some prior experience will be necessary.
- Jupyter Notebooks: this will likely only be used for this initial application but is also a very useful medium for sharing code and analyses. If you have prior experience with Jupyter notebooks, great! If not, we can help you learn it.
- Basic understanding of machine learning. We can incorporate more or less machine learning work in this project as needed though.
Each applicant will submit a Jupyter notebook that demonstrates an ability to work with the Python code and features that comprise the topic classification model as well as the ability to do some basic model evaluation. Note that unlikely many Outreachy tasks, we are not asking each applicant to claim a task but instead to all work independently on the same task. Feel free to help each other out though! This task is described here: T246013
Research paper with more background on topic classification models at Wikimedia: https://dl.acm.org/doi/10.1145/3274290
Overview of existing ORES topic models: https://www.mediawiki.org/wiki/ORES#Topic_routing