Retraining models from ORES to be deployable on Lift Wing
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Chtnnh
	Mar 23 2021, 7:56 PM

Description

IMPORTANT: Make sure to read the GSoC participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Summary

The Machine Learning team at the Wikimedia Foundation works with the aim of building ethical ML solutions to help Wikimedia communities and teams in the pursuit of open knowledge. In this pursuit, there is now a need for a new technical infrastructure to help scale these solutions even further. The current system, ORES has been enabling ML at the foundation for ~6 years. The proposed infrastructure, Lift Wing aims to enable more widespread participation and collaboration with volunteers and communities by lowering the bar to contribute models to the system.

Working in this direction, this project aims to have a Google Summer of Code intern work in designing and training three ML models over the course of the internship which recreate the performance of three currently deployed ORES models.

Motivation

The motivation behind this project is to begin the process of recreating models that are available on ORES for deployment on Lift Wing in the future. In addition to this, this project will act as a catalyst and proof-of-concept for volunteer contribution and accessibility of the Machine Learning projects at the foundation. This project aims to lower the barriers of participating in ML at WMF by reducing the amount of time and effort new contributors have to spend on learning technologies unique to Wikimedia and allowing them to jump straight to building models that communities require by using libraries and packages that they are already familiar with.

How does this help communities?

The direct impact of this project is two fold:

Allow communities to utilize new models that are Lift Wing deployable
Act as proof-of-concept for lowering the bar for participation in ML at WMF

Description

The models in ORES are built on scikit learn and utilize custom written libraries like revscoring, mwparserfromhell and more. The goal of this project will be to train ML models that perform equivalent or better than the existing models without using revscoring and the ORES infrastructure.

To state this precisely, to successfully complete the project, the intern must be able to achieve equivalent or better performance on the same task given the original data used to train the models and ensuring that all the libraries they use are open-source and industry standard (e.g tensorflow, pytorch, scikit learn, hugging face, etc) with the exception of mwparserfromhell which is required to parse the wikitext that is included in the data.

The intern will be expected to submit three jupyter notebooks, one for each model, which contain the code for data loading, data preprocessing, exploratory data analysis, feature engineering, model building and model validation with appropriate documentation and comments within the notebook and a separate README for the repository.

(The specific models to recreate are left as a choice to the intern and can be discussed)

Mentors

@calbon Director, Machine Learning, WMF
@Chtnnh Google Summer of Code '20 intern

You can reach out to the mentors by commenting on this task (preferred) or via Zulip (chtnnh)

Microtasks

Completion of the following microtasks will help the aspirant prepare for their GSoC application and for their internship, if selected.

Understand the ORES architecture at a high level. Documentation can be found here
Understand the revscoring architecture at a high level. Documentation can be found here
Identify models to recreate by going through the list of current models here
Submit proposal!

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T278261 Retraining models from ORES to be deployable on Lift Wing
Declined	Vini7148	T279819 Retraining models from ORES to be deployable on Lift Wing
Declined	Psingh07	T279823 Proposal : Retraining models from ORES to be deployable on Lift Wing
Declined	Apoorv-Nsut	T279074 GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing
Declined	Sourabh112112	T279134 GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing
Duplicate	None	T279941 Insert project title here
Invalid	None	T279943 Insert project title here
Resolved	Anubhav-sharma13	T279961 Proposal : Retraining models from ORES to be deployable on Lift Wing

Event Timeline

Chtnnh created this task.Mar 23 2021, 7:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2021, 7:56 PM

srishakatux moved this task from Backlog to 2021 on the Google-Summer-of-Code board.Mar 23 2021, 8:00 PM

srishakatux edited projects, added Google-Summer-of-Code (2021); removed Google-Summer-of-Code.

He7d3r subscribed.Mar 23 2021, 8:11 PM

Apoorv-Nsut subscribed.Mar 24 2021, 7:46 AM

@Chtnnh , I am Apoorv Garg, a B.tech sophomore in Information technology.I came across this project through Google summer of code and I would like to contribute in the same.I have the knowledge of the skills required.I have also a little experience in Machine learning. I have gone through this link https://www.mediawiki.org/wiki/ORES as of now.Thank you.

@Apoorv-Nsut Hello! Thank you for your interest in the project. I hope the microtasks are clear. The goal of the microtasks is to get contributors familiar with the models that they will be retraining.

You can reach to the mentors with any questions you have here on the task or via Zulip.

Once you are familiar with the models, you can pick three of them after consulting with the mentors and create an outline of how you will be retraining the selected models.

All the best! Hope to see you contributing.

Hi @Chtnnh,
I am Sourabh, a B.Tech sophomore in Computer Science. I'm an aspiring participant of GSOC'21 and this project piqued my interest. I have the knowledge of the skills required. And have some experience in ML.

Hello @Sourabh112112!

Sounds great, why don't you go ahead and try your hand at the microtasks and let me know if you have any questions. All the best!

kevinbazira subscribed.Mar 24 2021, 5:21 PM

Chtnnh moved this task from Backlog to Featured Projects on the Google-Summer-of-Code (2021) board.Mar 25 2021, 4:45 PM

Hey mentor @Chtnnh I am Paritosh Singh a sophomore in B.Tech CS, NIT Kkr, and I have had my previous share of experience working with the ML team for Model Building. I have gone through the Micro-tasks and would love to take up the project for GSOC'21. I believe I possess the skills required and have good experience in ML and working with the team previously.

Hey @Psingh07

Yes, I encourage you to apply and would also like to remind you that if you have any questions during the process you can send them across here or via Zulip.

All the best!

@Chtnnh , Sir I have gone through the micro-tasks and learned about ORES and revscoring in detail.Also gone through the research on the Topic "Automated classification of edit quality". Kindly guide me through the next step.

Hello @Apoorv-Nsut

First things first, no need to call me sir 😄

Secondly, great job on completing the microtasks and reading up on "Automated classification of edit quality". I think you are well suited to begin looking at the models themselves and understanding their specific ML related features. For example, what kind of model is it, how was it tuned, etc. This will help you choose the models you will propose to retrain.

Note to all participants: We recommend choosing models from three different model classes, i.e. articlequality, draftquality, drafttopic, editquality

Hope this helps. You can reach out here if you have any further doubts!

@Chtnnh, I have completed the first 3 microtasks and now I'm reading more about drafttopic model. From " https://github.com/wikimedia/drafttopic ".

@Chtnnh, I have a doubt about your statement

Note to all participants: We recommend choosing models from three different model classes, i.e. articlequality, draftquality, drafttopic, editquality

So, Do we have to select any three of them, or at present do we have to select any one of them ??

@Sourabh112112

Great job on completing the microtasks!

What I meant by

Note to all participants: We recommend choosing models from three different model classes, i.e. articlequality, draftquality, drafttopic, editquality

was that as part of the internship you are required to retrain three models, which we recommend you should select in such a manner that you select only one from each of the four classes mentioned above.

For example, I select one articlequality, one draftquality and one drafttopic model instead of selecting three articlequality models or two draftquality models and one editquality model

I hope this statement is clear now. Let me know if you have any other questions!

Hello @Chtnnh My name is umang gupta , a second yr undergrad student . i have little knowledge about ML and but i am quick learner and want to contribute in this project . plz guide me about same

In T278261#6951656, @Iamumangg wrote:

Hello @Chtnnh My name is umang gupta , a second yr undergrad student . i have little knowledge about ML and but i am quick learner and want to contribute in this project . plz guide me about same

I'd suggest reading the task description, there's the microtask section, as well as the comments above.

Hey Chanitanya ,

I am Anubhav Sharma , a third year Undergrad research student at IIIT Hyderabad . I have gone through all the micro tasks as well as known about the models . I have the experience of working with the existing models (both statistical as well as neural ) . Wanted to ask you that , can we propose some better architechture for the existing models , so that the performance can be enhanced. You know by replacing some of the rnn based approaches with the latest transformer based approaches .

Hello @Iamumangg! I recommend going through the task description and posting your progress updates and questions here. All the best!

Hello @Anubhav-sharma13! Welcome to Wikimedia. To answer your question, yes you can definitely propose better architectures for the models. Although speaking from experience, a neural based approach usually performs similar to a statistical approach on this task as it is not complex enough for neural models to have a significant advantage. Regardless, if you want to try building some lightweight neural architectures that can improve the performance, you are more than welcome to do so. All the best!

Let me know if you have any questions or updates.

Pavithraes updated the task description. (Show Details)Mar 29 2021, 3:56 PM

Apoorv-Nsut added a parent task: T279074: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 1 2021, 5:20 PM

Hello, @Chtnnh My name is Nwobodo Leonard, a 3rd-year student of Mechanical Engineering from Nigeria. I am interested in this project and would like to participate, i am going through the microtask of getting myself familiar with the Ores architecture and revscoring. So far it is not really making too much sense but i am determined to keep trying. I have some fundamental knowledge of python and machine learning. Can I ask all of my questions here or on Zulip?

@Chtnnh how can i get more information on the models to recreate? I have gone to this page https://ores-support-checklist.toolforge.org/ but i am don't understand what i should be looking for and how to identify a model that needs to be recreated

In T278261#6966619, @NnaKene wrote:

@Chtnnh how can i get more information on the models to recreate? I have gone to this page https://ores-support-checklist.toolforge.org/ but i am don't understand what i should be looking for and how to identify a model that needs to be recreated

@NnaKene For more information on the Models, you can go through their GitHub " https://github.com/wikimedia ".

Chtnnh removed a parent task: T279074: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 2 2021, 6:59 AM

@NnaKene Hello Nwobodo. I think Zulip would be a good place to ask your questions if they are related to getting started or some technical details about the microtasks themselves.

Vini7148 subscribed.Apr 2 2021, 8:36 AM

Suryansh.srvstv subscribed.Apr 6 2021, 2:07 PM

Vini7148 mentioned this in T279819: Retraining models from ORES to be deployable on Lift Wing.Apr 10 2021, 11:14 AM

Vini7148 added a subtask: T279819: Retraining models from ORES to be deployable on Lift Wing.Apr 10 2021, 11:17 AM

Psingh07 added a subtask: T279823: Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 10 2021, 12:04 PM

Apoorv-Nsut added a parent task: T279074: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 10 2021, 12:36 PM

Apoorv-Nsut removed a parent task: T279074: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing.

Apoorv-Nsut added a subtask: T279074: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 10 2021, 12:40 PM

Sourabh112112 added a subtask: T279134: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 10 2021, 1:37 PM

Aklapper added a subtask: T279941: Insert project title here.Apr 12 2021, 3:56 PM

Aklapper added a subtask: T279943: Insert project title here.Apr 12 2021, 5:50 PM

Anubhav-sharma13 added a subtask: T279961: Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 12 2021, 9:09 PM

Apoorv-Nsut mentioned this in T279074: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing.Apr 13 2021, 5:51 AM

Chtnnh closed subtask T279943: Insert project title here as Invalid.Apr 17 2021, 11:22 AM

Gopavasanth closed this task as Declined.May 18 2021, 6:50 AM

This comment was removed by Gopavasanth.

Gopavasanth reopened this task as Open.May 18 2021, 6:51 AM

Gopavasanth closed subtask T279819: Retraining models from ORES to be deployable on Lift Wing as Declined.May 18 2021, 7:00 AM

Gopavasanth closed subtask T279074: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing as Declined.May 18 2021, 7:13 AM

Gopavasanth closed subtask T279134: GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing as Declined.May 18 2021, 7:16 AM

Gopavasanth closed subtask T279823: Proposal : Retraining models from ORES to be deployable on Lift Wing as Declined.

Psingh07 unsubscribed.May 18 2021, 8:01 AM

Manuel subscribed.Jun 17 2021, 8:44 AM

Gopavasanth closed subtask T279961: Proposal : Retraining models from ORES to be deployable on Lift Wing as Resolved.Dec 19 2021, 9:12 AM

GSoC 2021 is long over. Is there anything remaining in this task before it can be resolved? Please consider moving the leftovers to a new task and close this one. Thank you!

I am assuming that there are no pending tasks here as T279961 is resolved; closing this task for now. If there are any pending items, please create a new ticket. Thanks!

isarantopoulos moved this task from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 20 2023, 11:40 AM

Retraining models from ORES to be deployable on Lift WingClosed, ResolvedPublicActions

Description

Summary

Motivation

How does this help communities?

Description

Mentors

Microtasks

Related ObjectsSearch...

Event Timeline

Retraining models from ORES to be deployable on Lift Wing
Closed, ResolvedPublic
Actions

Related Objects
Search...