Summary
The Machine Learning team at the Wikimedia Foundation works with the aim of building ethical ML solutions to help Wikimedia communities and teams in the pursuit of open knowledge. In this pursuit, there is now a need for a new technical infrastructure to help scale these solutions even further. The current system, ORES has been enabling ML at the foundation for ~6 years. The proposed infrastructure, Lift Wing aims to enable more widespread participation and collaboration with volunteers and communities by lowering the bar to contribute models to the system.
Working in this direction, this project aims to have a Google Summer of Code intern work in designing and training three ML models over the course of the internship which recreate the performance of three currently deployed ORES models.
Motivation
The motivation behind this project is to begin the process of recreating models that are available on ORES for deployment on Lift Wing in the future. In addition to this, this project will act as a catalyst and proof-of-concept for volunteer contribution and accessibility of the Machine Learning projects at the foundation. This project aims to lower the barriers of participating in ML at WMF by reducing the amount of time and effort new contributors have to spend on learning technologies unique to Wikimedia and allowing them to jump straight to building models that communities require by using libraries and packages that they are already familiar with.
How does this help communities?
The direct impact of this project is two fold:
- Allow communities to utilize new models that are Lift Wing deployable
- Act as proof-of-concept for lowering the bar for participation in ML at WMF
Description
The models in ORES are built on scikit learn and utilize custom written libraries like revscoring, mwparserfromhell and more. The goal of this project will be to train ML models that perform equivalent or better than the existing models without using revscoring and the ORES infrastructure.
To state this precisely, to successfully complete the project, the intern must be able to achieve equivalent or better performance on the same task given the original data used to train the models and ensuring that all the libraries they use are open-source and industry standard (e.g tensorflow, pytorch, scikit learn, hugging face, etc) with the exception of mwparserfromhell which is required to parse the wikitext that is included in the data.
The intern will be expected to submit three jupyter notebooks, one for each model, which contain the code for data loading, data preprocessing, exploratory data analysis, feature engineering, model building and model validation with appropriate documentation and comments within the notebook and a separate README for the repository.
(The specific models to recreate are left as a choice to the intern and can be discussed)
Mentors
@calbon Director, Machine Learning, WMF
@Chtnnh Google Summer of Code '20 intern
You can reach out to the mentors by commenting on this task (preferred) or via Zulip (chtnnh)
Microtasks
Completion of the following microtasks will help the aspirant prepare for their GSoC application and for their internship, if selected.