Page MenuHomePhabricator

GSOC 2021 Proposal : Retraining models from ORES to be deployable on Lift Wing
Closed, DeclinedPublic

Description

Proposal for : https://phabricator.wikimedia.org/T278261

Profile

Name: Apoorv Garg
IRC nickname on Freenode: Apoorv-Nsut
Proposal pdf: T278261_Proposal_Apoorv
Web Profile: Linkedin
Location: Uttar Pradesh, India
Time Zone: UTC+05:30
Working hours: 3 PM to 8 PM. (UTC+05:30)

Synopsis

Anyone can edit the articles on Wikipedia. To ensure high-quality articles, an automated search on Wikipedia is needed, which has been given by ORES. Since ORES is unique to Wikipedia, new contributors must devote extra time and efforts in learning ORES and revscoring before diving into the models.
The project aims to build models that are based on open-source libraries and the latest machine learning technologies with the following objectives :

  1. Allow communities to utilize new models that are Lift Wing deployable
  2. Act as a proof-of-concept for lowering the bar for participation in ML at WMF
  3. To achieve equivalent or better performance

Article quality : This model categorize an article into one of the 6 classes ['Stub', 'Start', 'C', 'B', 'GA', ‘FA’]. These classes define the quality of an article.
Articlequailty/enwiki has used GradientBoosting Model from revscoring lib with deviance as loss function.

Draft topic : This model predicts the topic of a new article draft.
Drafttopic/enwiki has used GradientBoosting Model from revscoring lib with deviance as loss function.

Edit quality : This model classifies that the “edits” done are damaging, good faith or reverted.
Damaging : predicts whether or not an edit causes damage
Editquality/enwiki/damaging has used GradientBoosting Model from revscoring lib with deviance as loss function.

I am in contact with @Chtnnh through Zulip and phabricator.

Technical approach

GS_.jpg (291×711 px, 31 KB)

The models will be implemented in Python.
Each model is further classified into 4 parts:
Data loading: Data from Wikipedia articles will be extracted through widely used libraries like beautiful soup and request and stored in a buffer,
Data preprocessing: The data is extracted in natural language, which contains a lot of useless information. Which should be cleaned up by eliminating stopwords and restoring the words to their root form ( Stemming ). Now that the data has been cleaned, it must be vectorized to move through the model. A local vocabulary (Word2vec) will be created and stored in the type of dictionary for future use. All these steps can be performed by the nltk library.

Bag of word pipeline will be used: Tokenization —> Stopwords removal —> Stemming —> TF-IDF vectorization

Feature engineering: Raw data will be transformed into something meaningful. One Hot encoding method will be applied that changes categorical data to a numerical format and enables to group categorical data without losing any information. Though, there is no standard theory to find the best feature set. When a new feature is defined, only the validation method is empirical testing.
Model Implementation: We have a multiclass classification problem in two models (Article quality and Draft topic ) and a binary classification problem in one model (edit quality/damaging). For text classification models, neural network models can be used for achieving better results. Though they require high computation but result achieved from them compensate the computing cost.
Furthermore, the focus of the implemented model will be on -

  1. Hyperparameter tuning: Random search method or Gradient-based optimization technique will be preferred.
  2. Performance analysis chart: matplotlib library will be used to showcase the performance metric of the model. 
  3. Cross-validation/Early stopping: Either technique will be used to prevent overfitting.
  4. Statistical analysis: A confusion matrix will be used to further observe the accuracy metric of the re-trained model.

The model's pipeline will be graphically displayed to display the flow of data and the model's architecture. This will aid in the coding and thought process. Also, enable new developers to quickly grasp the model flow.

The building process will be systematically documented, and a separate README for the repository will also be prepared.

Timeline

PeriodDescriptionTask
May 17 - May 23Community bonding periodInteracting with other Wikimedia community members to gain insight into how the community operates and what are the community's dos and don'ts are.
May 24 - May 30Community bonding periodWill discuss the ORES trained model in-depth, the Wikipedia template, and also discuss the previously read research papers with the mentors exhaustively
May 31 - June 6Community bonding periodPresent my ideas of contributing to the stakeholder by way of suggesting machine learning models for better and efficient performance for the pre-decided model.
June 7 - June 13Week 1 M1 M2 M3: The model's pipeline will be graphically displayed to display the flow of data and the model's architecture. This will aid in the coding and thought process. Also, enable new developers to quickly grasp the model flow.
June 14 - June 20Week 2M1 M2 M3 : Data loading and extraction will be implemented in all three retrained models. Standard libraries will be used in the process such as beautifulsoup and request. The regular expression will be used to further clean the data and extract the raw data from the Wikipedia template.
June 21 - June 27Week 3M1: Throughout the week, data preprocessing, and preparation will be carried out to obtain the best prediction possible by using the nltk library.Word2Vec dictionary will be saved with the help of the pickle library. Training and tuning the neural-based model by using RandomizedSearchCV and improving the performance of the retrained model. Different features will be evaluated to achieve higher performance. We will evaluate model configurations using repeated stratified k-fold cross-validation with three repeats and 10 folds. Callbacks and Model checkpoints will be saved to obtain the optimal model.
June 28 - July 4 Week 4M1: Testing the model, building confusion matrix to analyze the performance of the model. The model architecture will be documented and a separate README for the repository. Getting it to review by the mentors and do changes, if necessary.
July 5 - July 11 Week 5M2: data preprocessing that would be specific to the draft topic model will be developed.Training and tuning the new models by the Randomsearch method and improving the performance of the retrained model. The neural-based model will be implemented with cosine similarity as a loss function and optimizer as Adam.
July 12 - July 16 Phase 1 Evaluation Review the work that was done since the beginning of the program. Submit evaluations on mentors.
July 17 - July 23Week 6M2: Testing the model, building confusion matrix to analyze the performance of the model. The model architecture will be documented and a separate README for the repository. Getting it to review by the mentors and do changes, if necessary.
July 24 - July 30Week 7M3: data preprocessing that would be specific to the edit quality model will be developed.Training and tuning the new models by the Randomsearch method and improving the performance of the retrained model. The neural-based model will be implemented with binary cross-entropy as a loss function and optimizer as Adam.
July 31 - August 6Week 8M3: Testing the model, building confusion matrix to analyze the performance of the model. The model architecture will be documented and a separate README for the repository. Getting it to review by the mentors and do changes, if necessary.
August 7 - August 13Week 9 M1 M2 M3: To ensure adequate model performance, With the help of mentors and members of the English Wiki group tested and compared model prediction of retrained models. Completing and finalizing all aspects of the models.
August 14 - August 15Work completed after Phase 1 Evaluations will be self-evaluated, and top-performing models will be selected.
August 16 - August 23Final EvaluationAssist with integration and implementation of models and ensure that they are published into production. Mentors submit final student evaluations.
August 24Future with WikimediaExpanding model retraining through other wikis, as well as assisting new developers in doing so. Improve the model's efficiency and reliability by adding new, appropriate features. Examine the retrained models and solicit input from the group on the wiki.
August 31Final results of Google Summer of Code 2021 announced

Deliverables

Rapid Application Development (RAD) software model will be implemented for the project.
Jupyter Notebook maintained with proper inline comments and documentation

  • M1: Flowchart: Pipeline of Model: Describing the interaction among the modules, Flow of data, the architecture of the model
  • M1: Jupyter Notebook and model.h5/pkl
  • M1: Tuning Report and documentation
  • M1: Readme.md
  • M2: Flowchart: Pipeline of Model: Describing the interaction among the modules, Flow of data, the architecture of the model
  • M2: Jupyter Notebook and model.h5/pkl
  • M2: Tuning Report and documentation
  • M2: Readme.md
  • M3: Flowchart: Pipeline of Model: Describing the interaction among the modules, Flow of data, the architecture of the model
  • M3: Jupyter Notebook and model.h5/pkl
  • M3: Tuning Report and documentation
  • M3: Readme. MD

Tools

Numpy, matplotlib, mwparserfromhell, nltk ,pickle, Tensorflow/Keras, sci-kit learn, and PyTorch

Participation

  • I'll build a new git repository with two branches. Code will be submitted to the dev branch regularly, and once checked and verified, it will be merged with the master branch.
  • During my working hours (3:00 pm to 8:00 pm UTC +5:30), I will be available on IRC to collaborate with the mentors.
  • For bug and subtask management, I'll use Phabricator and Zulip.
  • During non-working hours, I will be available via Gmail to be contacted.

About Me

I am currently enrolled in the Netaji Subhas University of Technology in Delhi, pursuing a B.tech in Information Technology. I am a dedicated learner who works with full zeal and enthusiasm. I prioritize my commitments and balance each and every aspect scrupulously. This is my first time participating in the GSoC program. I heard about this program through our professors and Seniors. Since I won't have any other obligations during this summer, GSoC will be my top priority.

Wikipedia's vision of making content accessible in any natural language inspires and excites me. I assume that contributing to Wikimedia will have a positive effect on the learning community. Considering all the relevant learnings I expect to gain from this project, I contemplate this experience as an extensive skill-enhancing experience for my succeeding career. Making this project happen will be one of the greatest accomplishments I wish to achieve.

Past Experience

I've worked with C++, Java, Python, HTML, CSS, Javascript, and Node.js, among other languages. I have experience in MySQL among other databases. Among VCS(Version control system), git is most preferred. macOS is the operating system that I use the most.

  • Individual Project -
    • AI Image Captioning Bot - Built an internet bot that will take an image as an input and will predict a caption as its output. ResNet50 architecture was used along with transfer learning. A caption generator Module was built which was embedded in the website through Flask.
    • Covid detection using X-ray - An X-ray dataset was preprocessed in a separate Jupyter notebook and a CNN-based model was built in another making use of a generator function to load the big dataset.
    • Sentiment analysis - A Multiclass Naive Bayes classifier was implied in a movie review dataset. Natural Language Processing (NLP) was used to vectorized the dataset as well as to form a local vocabulary.
  • Group Project -
    • Cansat - A CanSat is a type of sounding rocket payload used to teach space technology. Along with a group of 9 members, we built a cansat. I was assigned to build a GUI and flight software model.

Microtask carried out

  1. ORES Documentation " https://ores.readthedocs.io/en/latest/ "
  2. Revscoring Documentation " https://revscoring.readthedocs.io/en/latest/ "
  3. Identify models to recreate " https://ores-support-checklist.toolforge.org/ "

Event Timeline

Apoorv-Nsut updated the task description. (Show Details)
Apoorv-Nsut updated the task description. (Show Details)

@Apoorv-Nsut Good work on the proposal, just try to not spam with edits. Would have recommended finalizing the proposal to a fair degree before creating the Phabricator task. Nevertheless, try to reduce spam going forward.

@Apoorv-Nsut Good work on the proposal, just try to not spam with edits. Would have recommended finalizing the proposal to a fair degree before creating the Phabricator task. Nevertheless, try to reduce spam going forward.

@Chtnnh I am really sorry for this. I didn't know it will spam your mail. Will Keep it in Mind. I accidentally created the task early, I tried deleting it but there were no options. I won't repeat the mistake again. Sorry for the inconvenience.

@Chtnnh @calbon, this is the first draft of my proposal. Please comment

Hey @Apoorv-Nsut

Thanks for showing your interest to participate in Google Summer of Code with Wikimedia Foundation! Please make sure to upload a copy of your proposal on Google's program site as well in whatever format it's expected of you, include in it this public proposal of Phabricator before the deadline i.e April 13th. Good luck :)

Hello Apoorv! Great job on the proposal so far, just some feedback:

  1. It would be great if you could add a section briefing the readers on your technical approach. Preferably, keep this before the timeline.
  2. Try and make the timeline more detailed, add reasoning for why you have designed it in such a manner.
  3. Details about your technical approach would be great. Feel free to add it to any relevant section.

All the best!

Thank you for your feedback, I have made the necessary changes in my proposal.

GSoC application deadline has passed. If you have submitted a proposal on the GSoC program website, please visit https://phabricator.wikimedia.org/project/view/5104/ and then drag your own proposal from the "Backlog" to the "Proposals Submitted" column on the Phabricator workboard. You can continue making changes to this ticket on Phabricator and have discussions with mentors and community members about the project. But, remember that the decision will not be based on the work you did after but during and before the application period. Note: If you have not contacted your mentor(s) before the deadline and have not contributed a code patch before the application deadline, you are unfortunately not eligible. Thank you!

Hello @srishakatux !

This specific project did not require the applicants to submit a code patch before the application deadline. Although it did require them to do research, similar to the project I mentored in Outreachy Round 21. All the submitted proposals are based on the research conducted by the applicant. In addition, all the applicants have been in contact with either Chris or myself and hence all of them are eligible.

@Apoorv-Nsut Kindly follow the other instructions that Srishti has given in her previous comment. Thank you.

@Chtnnh, I have already moved my proposal from backlog to the Proposals submitted column as instructed by the mentor @srishakatux.
Thank you

@Apoorv-Nsut ​We are sorry to say that we could not allocate a slot for you this time. Please do not consider the rejection to be an assessment of your proposal. We received over 100 quality applications, and we could only accept 10 students. We were not able to give all applicants a slot that would have deserved one, and these were some very tough decisions to make. Please know that you are still a valued member of our community and we by no means want to exclude you. Many students who we did not accept in 2020 have become Wikimedia maintainers, contractors and even GSoC students and mentors this year!

Your ideas and contributions to our projects are still welcome! As a next step, you could consider finishing up any pending pull requests or inform us that someone has to take them over. Here is the recommended place for you to get started as a newcomer: https://www.mediawiki.org/wiki/New_Developers.

If you would still be eligible for GSoC next year, we look forward to your participation