Page MenuHomePhabricator

Retraining models from ORES to be deployable on Lift Wing
Open, Needs TriagePublic

Description

IMPORTANT: Make sure to read the GSoC participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Summary

The Machine Learning team at the Wikimedia Foundation works with the aim of building ethical ML solutions to help Wikimedia communities and teams in the pursuit of open knowledge. In this pursuit, there is now a need for a new technical infrastructure to help scale these solutions even further. The current system, ORES has been enabling ML at the foundation for ~6 years. The proposed infrastructure, Lift Wing aims to enable more widespread participation and collaboration with volunteers and communities by lowering the bar to contribute models to the system.

Working in this direction, this project aims to have a Google Summer of Code intern work in designing and training three ML models over the course of the internship which recreate the performance of three currently deployed ORES models.

Motivation

The motivation behind this project is to begin the process of recreating models that are available on ORES for deployment on Lift Wing in the future. In addition to this, this project will act as a catalyst and proof-of-concept for volunteer contribution and accessibility of the Machine Learning projects at the foundation. This project aims to lower the barriers of participating in ML at WMF by reducing the amount of time and effort new contributors have to spend on learning technologies unique to Wikimedia and allowing them to jump straight to building models that communities require by using libraries and packages that they are already familiar with.

How does this help communities?

The direct impact of this project is two fold:

  1. Allow communities to utilize new models that are Lift Wing deployable
  2. Act as proof-of-concept for lowering the bar for participation in ML at WMF

Description

The models in ORES are built on scikit learn and utilize custom written libraries like revscoring, mwparserfromhell and more. The goal of this project will be to train ML models that perform equivalent or better than the existing models without using revscoring and the ORES infrastructure.

To state this precisely, to successfully complete the project, the intern must be able to achieve equivalent or better performance on the same task given the original data used to train the models and ensuring that all the libraries they use are open-source and industry standard (e.g tensorflow, pytorch, scikit learn, hugging face, etc) with the exception of mwparserfromhell which is required to parse the wikitext that is included in the data.

The intern will be expected to submit three jupyter notebooks, one for each model, which contain the code for data loading, data preprocessing, exploratory data analysis, feature engineering, model building and model validation with appropriate documentation and comments within the notebook and a separate README for the repository.

(The specific models to recreate are left as a choice to the intern and can be discussed)

Mentors

@calbon Director, Machine Learning, WMF
@Chtnnh Google Summer of Code '20 intern

You can reach out to the mentors by commenting on this task (preferred) or via Zulip (chtnnh)

Microtasks

Completion of the following microtasks will help the aspirant prepare for their GSoC application and for their internship, if selected.

  1. Understand the ORES architecture at a high level. Documentation can be found here
  2. Understand the revscoring architecture at a high level. Documentation can be found here
  3. Identify models to recreate by going through the list of current models here
  4. Submit proposal!

Event Timeline

@Chtnnh , I am Apoorv Garg, a B.tech sophomore in Information technology.I came across this project through Google summer of code and I would like to contribute in the same.I have the knowledge of the skills required.I have also a little experience in Machine learning. I have gone through this link https://www.mediawiki.org/wiki/ORES as of now.Thank you.

@Apoorv-Nsut Hello! Thank you for your interest in the project. I hope the microtasks are clear. The goal of the microtasks is to get contributors familiar with the models that they will be retraining.

You can reach to the mentors with any questions you have here on the task or via Zulip.

Once you are familiar with the models, you can pick three of them after consulting with the mentors and create an outline of how you will be retraining the selected models.

All the best! Hope to see you contributing.

Hi @Chtnnh,
I am Sourabh, a B.Tech sophomore in Computer Science. I'm an aspiring participant of GSOC'21 and this project piqued my interest. I have the knowledge of the skills required. And have some experience in ML.

Hello @Sourabh112112!

Sounds great, why don't you go ahead and try your hand at the microtasks and let me know if you have any questions. All the best!

Hey mentor @Chtnnh I am Paritosh Singh a sophomore in B.Tech CS, NIT Kkr, and I have had my previous share of experience working with the ML team for Model Building. I have gone through the Micro-tasks and would love to take up the project for GSOC'21. I believe I possess the skills required and have good experience in ML and working with the team previously.

Hey @Psingh07

Yes, I encourage you to apply and would also like to remind you that if you have any questions during the process you can send them across here or via Zulip.

All the best!

@Chtnnh , Sir I have gone through the micro-tasks and learned about ORES and revscoring in detail.Also gone through the research on the Topic "Automated classification of edit quality". Kindly guide me through the next step.

Hello @Apoorv-Nsut

First things first, no need to call me sir 😄

Secondly, great job on completing the microtasks and reading up on "Automated classification of edit quality". I think you are well suited to begin looking at the models themselves and understanding their specific ML related features. For example, what kind of model is it, how was it tuned, etc. This will help you choose the models you will propose to retrain.

Note to all participants: We recommend choosing models from three different model classes, i.e. articlequality, draftquality, drafttopic, editquality

Hope this helps. You can reach out here if you have any further doubts!

@Chtnnh, I have completed the first 3 microtasks and now I'm reading more about drafttopic model. From " https://github.com/wikimedia/drafttopic ".

@Chtnnh, I have a doubt about your statement

Note to all participants: We recommend choosing models from three different model classes, i.e. articlequality, draftquality, drafttopic, editquality

So, Do we have to select any three of them, or at present do we have to select any one of them ??

@Sourabh112112

Great job on completing the microtasks!

What I meant by

Note to all participants: We recommend choosing models from three different model classes, i.e. articlequality, draftquality, drafttopic, editquality

was that as part of the internship you are required to retrain three models, which we recommend you should select in such a manner that you select only one from each of the four classes mentioned above.

For example, I select one articlequality, one draftquality and one drafttopic model instead of selecting three articlequality models or two draftquality models and one editquality model

I hope this statement is clear now. Let me know if you have any other questions!

Hello @Chtnnh My name is umang gupta , a second yr undergrad student . i have little knowledge about ML and but i am quick learner and want to contribute in this project . plz guide me about same

Hello @Chtnnh My name is umang gupta , a second yr undergrad student . i have little knowledge about ML and but i am quick learner and want to contribute in this project . plz guide me about same

I'd suggest reading the task description, there's the microtask section, as well as the comments above.

Hey Chanitanya ,

I am Anubhav Sharma , a third year Undergrad research student at IIIT Hyderabad . I have gone through all the micro tasks as well as known about the models . I have the experience of working with the existing models (both statistical as well as neural ) . Wanted to ask you that , can we propose some better architechture for the existing models , so that the performance can be enhanced. You know by replacing some of the rnn based approaches with the latest transformer based approaches .

Hello @Iamumangg! I recommend going through the task description and posting your progress updates and questions here. All the best!

Hello @Anubhav-sharma13! Welcome to Wikimedia. To answer your question, yes you can definitely propose better architectures for the models. Although speaking from experience, a neural based approach usually performs similar to a statistical approach on this task as it is not complex enough for neural models to have a significant advantage. Regardless, if you want to try building some lightweight neural architectures that can improve the performance, you are more than welcome to do so. All the best!

Let me know if you have any questions or updates.

Hello, @Chtnnh My name is Nwobodo Leonard, a 3rd-year student of Mechanical Engineering from Nigeria. I am interested in this project and would like to participate, i am going through the microtask of getting myself familiar with the Ores architecture and revscoring. So far it is not really making too much sense but i am determined to keep trying. I have some fundamental knowledge of python and machine learning. Can I ask all of my questions here or on Zulip?

@Chtnnh how can i get more information on the models to recreate? I have gone to this page https://ores-support-checklist.toolforge.org/ but i am don't understand what i should be looking for and how to identify a model that needs to be recreated

@Chtnnh how can i get more information on the models to recreate? I have gone to this page https://ores-support-checklist.toolforge.org/ but i am don't understand what i should be looking for and how to identify a model that needs to be recreated

@NnaKene For more information on the Models, you can go through their GitHub " https://github.com/wikimedia ".

@NnaKene Hello Nwobodo. I think Zulip would be a good place to ask your questions if they are related to getting started or some technical details about the microtasks themselves.

This comment was removed by Gopavasanth.