Overview
Finding topically-similar articles on Wikipedia is a very useful tool for both researchers and editors. For example, given Tim Cook as an input, one might consider Marissa Mayer to be a good topically-similar article because both individuals were CEOs of major technology platforms. In a process known as matching, this ability to identify similar articles that differ in some key way -- e.g., gender, article quality -- is potentially very useful to both researchers editors.
For researchers, if you want to isolate the impact of e.g., someone's gender on the quality of their article, you might build a sample of 10,000 articles about men and then, for each article, seek to find an article about someone who is similarly notable but identifies as a woman. In this case, Marissa Mayer may be a good match for Tim Cook. For editors, if you want to find a good example of a high-quality article about a similar topic to e.g., identify potential sections or templates to add, Steve Jobs might be a better match as he was also a CEO of Apple and has a substantially longer article.
This specific matching challenge has been carefully studied in Controlled Analyses of Social Biases in Wikipedia Bios by Field et al., who then designed an approach for doing this on Wikipedia using the category system.
Task
This research will implement Field et al.'s approach with the goal of building an API for it such that it may easily used by researchers and editors in their work:
- Read Controlled Analyses of Social Biases in Wikipedia Bios
- Implement tf-idf pivot algorithm in PAWS (Jupyter Notebook) on a very small language (so as to not exceed memory constraints). Use the categorylinks table.
- [Stretch] Showcase approach as Toolforge app.
Rationale
If the approach proves to generate reasonable results, it could be worth further investment of resources to build a fully-functional API that could be used by editors. Showcasing a fully-functional pipeline would allow researchers to more easily incorporate it into their work.
Recommended Skills
- Python: the pipeline and any API would be built in Python
- Bonus: experience with Flask (API) and SQLite (potential database backend) Python libraries will help with showcasing the tool. Same with Javascript for the front-end, though this can easily be learned or adapted from elsewhere.
- Algorithms / data structures for understanding the algorithm described in Field et al..
Acceptance Criteria
- The output of this task will be a PAWS notebook showcasing the pipeline for processing category data, algorithm, and a few examples of outputs.
- The stretch goal (optional) would take this PAWS notebook and extend it into an interactive web API hosted on Toolforge.
- NOTE: it is not expected that this tool can be implemented for all of English Wikipedia given its size, but smaller languages like Simple English Wikipedia should work.
Process
- If you are interested in this task and it is not assigned to anyone, you may begin work on it. Please leave a comment on the task and tag @Isaac so that he is aware.
- If you have made some progress on the task (draft code implementing algorithm in PAWS notebook) and would like to continue, share a link to your current draft and let @Isaac know so that he can assign the task to you and help you to plot out the next steps.
- Generally, @Isaac will be able to answer any questions about the task and try to respond quickly when clarification is necessary but response times may be slow if help is needed for more general debugging etc.
Resources
- Code from Field et al.
- This code contains the pivot matching algorithm but it looks like you'll still need to develop code for extracting category information from the category dumps (below).
- Categorylinks table
- mwsql library for easily processing categorylinks table
- PAWS infrastructure, which are Wikimedia-hosted Jupyter notebooks that provide local access to Wikimedia data
- Examples of processing relevant data on PAWS: https://public.paws.wmcloud.org/User:Isaac_(WMF)/Basic%20Data%20Access%20on%20PAWS.ipynb#SQL-Dumps
- Algorithm description: Controlled Analyses of Social Biases in Wikipedia Bios