Proposal for : https://phabricator.wikimedia.org/T278261
Profile
Name: Apoorv Garg
IRC nickname on Freenode: Apoorv-Nsut
Proposal pdf: T278261_Proposal_Apoorv
Web Profile: Linkedin
Location: Uttar Pradesh, India
Time Zone: UTC+05:30
Working hours: 3 PM to 8 PM. (UTC+05:30)
Synopsis
Anyone can edit the articles on Wikipedia. To ensure high-quality articles, an automated search on Wikipedia is needed, which has been given by ORES. Since ORES is unique to Wikipedia, new contributors must devote extra time and efforts in learning ORES and revscoring before diving into the models.
The project aims to build models that are based on open-source libraries and the latest machine learning technologies with the following objectives :
- Allow communities to utilize new models that are Lift Wing deployable
- Act as a proof-of-concept for lowering the bar for participation in ML at WMF
- To achieve equivalent or better performance
- M1 : Model 1 [Article quality] enwiki : https://github.com/wikimedia/articlequality
Article quality : This model categorize an article into one of the 6 classes ['Stub', 'Start', 'C', 'B', 'GA', ‘FA’]. These classes define the quality of an article.
Articlequailty/enwiki has used GradientBoosting Model from revscoring lib with deviance as loss function.
- M2 : Model 2 [Draft topic] enwiki : https://github.com/wikimedia/drafttopic
Draft topic : This model predicts the topic of a new article draft.
Drafttopic/enwiki has used GradientBoosting Model from revscoring lib with deviance as loss function.
- M3 : Model 3 [Edit quality/damaging] enwiki : https://github.com/wikimedia/editquality
Edit quality : This model classifies that the “edits” done are damaging, good faith or reverted.
Damaging : predicts whether or not an edit causes damage
Editquality/enwiki/damaging has used GradientBoosting Model from revscoring lib with deviance as loss function.
I am in contact with @Chtnnh through Zulip and phabricator.
Technical approach
The models will be implemented in Python.
Each model is further classified into 4 parts:
Data loading: Data from Wikipedia articles will be extracted through widely used libraries like beautiful soup and request and stored in a buffer,
Data preprocessing: The data is extracted in natural language, which contains a lot of useless information. Which should be cleaned up by eliminating stopwords and restoring the words to their root form ( Stemming ). Now that the data has been cleaned, it must be vectorized to move through the model. A local vocabulary (Word2vec) will be created and stored in the type of dictionary for future use. All these steps can be performed by the nltk library.
Bag of word pipeline will be used: Tokenization —> Stopwords removal —> Stemming —> TF-IDF vectorization
Feature engineering: Raw data will be transformed into something meaningful. One Hot encoding method will be applied that changes categorical data to a numerical format and enables to group categorical data without losing any information. Though, there is no standard theory to find the best feature set. When a new feature is defined, only the validation method is empirical testing.
Model Implementation: We have a multiclass classification problem in two models (Article quality and Draft topic ) and a binary classification problem in one model (edit quality/damaging). For text classification models, neural network models can be used for achieving better results. Though they require high computation but result achieved from them compensate the computing cost.
Furthermore, the focus of the implemented model will be on -
- Hyperparameter tuning: Random search method or Gradient-based optimization technique will be preferred.
- Performance analysis chart: matplotlib library will be used to showcase the performance metric of the model.
- Cross-validation/Early stopping: Either technique will be used to prevent overfitting.
- Statistical analysis: A confusion matrix will be used to further observe the accuracy metric of the re-trained model.
The model's pipeline will be graphically displayed to display the flow of data and the model's architecture. This will aid in the coding and thought process. Also, enable new developers to quickly grasp the model flow.
The building process will be systematically documented, and a separate README for the repository will also be prepared.
Timeline
| Period | Description | Task |
| May 17 - May 23 | Community bonding period | Interacting with other Wikimedia community members to gain insight into how the community operates and what are the community's dos and don'ts are. |
| May 24 - May 30 | Community bonding period | Will discuss the ORES trained model in-depth, the Wikipedia template, and also discuss the previously read research papers with the mentors exhaustively |
| May 31 - June 6 | Community bonding period | Present my ideas of contributing to the stakeholder by way of suggesting machine learning models for better and efficient performance for the pre-decided model. |
| June 7 - June 13 | Week 1 | M1 M2 M3: The model's pipeline will be graphically displayed to display the flow of data and the model's architecture. This will aid in the coding and thought process. Also, enable new developers to quickly grasp the model flow. |
| June 14 - June 20 | Week 2 | M1 M2 M3 : Data loading and extraction will be implemented in all three retrained models. Standard libraries will be used in the process such as beautifulsoup and request. The regular expression will be used to further clean the data and extract the raw data from the Wikipedia template. |
| June 21 - June 27 | Week 3 | M1: Throughout the week, data preprocessing, and preparation will be carried out to obtain the best prediction possible by using the nltk library.Word2Vec dictionary will be saved with the help of the pickle library. Training and tuning the neural-based model by using RandomizedSearchCV and improving the performance of the retrained model. Different features will be evaluated to achieve higher performance. We will evaluate model configurations using repeated stratified k-fold cross-validation with three repeats and 10 folds. Callbacks and Model checkpoints will be saved to obtain the optimal model. |
| June 28 - July 4 | Week 4 | M1: Testing the model, building confusion matrix to analyze the performance of the model. The model architecture will be documented and a separate README for the repository. Getting it to review by the mentors and do changes, if necessary. |
| July 5 - July 11 | Week 5 | M2: data preprocessing that would be specific to the draft topic model will be developed.Training and tuning the new models by the Randomsearch method and improving the performance of the retrained model. The neural-based model will be implemented with cosine similarity as a loss function and optimizer as Adam. |
| July 12 - July 16 | Phase 1 Evaluation | Review the work that was done since the beginning of the program. Submit evaluations on mentors. |
| July 17 - July 23 | Week 6 | M2: Testing the model, building confusion matrix to analyze the performance of the model. The model architecture will be documented and a separate README for the repository. Getting it to review by the mentors and do changes, if necessary. |
| July 24 - July 30 | Week 7 | M3: data preprocessing that would be specific to the edit quality model will be developed.Training and tuning the new models by the Randomsearch method and improving the performance of the retrained model. The neural-based model will be implemented with binary cross-entropy as a loss function and optimizer as Adam. |
| July 31 - August 6 | Week 8 | M3: Testing the model, building confusion matrix to analyze the performance of the model. The model architecture will be documented and a separate README for the repository. Getting it to review by the mentors and do changes, if necessary. |
| August 7 - August 13 | Week 9 | M1 M2 M3: To ensure adequate model performance, With the help of mentors and members of the English Wiki group tested and compared model prediction of retrained models. Completing and finalizing all aspects of the models. |
| August 14 - August 15 | Work completed after Phase 1 Evaluations will be self-evaluated, and top-performing models will be selected. | |
| August 16 - August 23 | Final Evaluation | Assist with integration and implementation of models and ensure that they are published into production. Mentors submit final student evaluations. |
| August 24 | Future with Wikimedia | Expanding model retraining through other wikis, as well as assisting new developers in doing so. Improve the model's efficiency and reliability by adding new, appropriate features. Examine the retrained models and solicit input from the group on the wiki. |
| August 31 | Final results of Google Summer of Code 2021 announced | |
Deliverables
Rapid Application Development (RAD) software model will be implemented for the project.
Jupyter Notebook maintained with proper inline comments and documentation
- M1: Flowchart: Pipeline of Model: Describing the interaction among the modules, Flow of data, the architecture of the model
- M1: Jupyter Notebook and model.h5/pkl
- M1: Tuning Report and documentation
- M1: Readme.md
- M2: Flowchart: Pipeline of Model: Describing the interaction among the modules, Flow of data, the architecture of the model
- M2: Jupyter Notebook and model.h5/pkl
- M2: Tuning Report and documentation
- M2: Readme.md
- M3: Flowchart: Pipeline of Model: Describing the interaction among the modules, Flow of data, the architecture of the model
- M3: Jupyter Notebook and model.h5/pkl
- M3: Tuning Report and documentation
- M3: Readme. MD
Tools
Numpy, matplotlib, mwparserfromhell, nltk ,pickle, Tensorflow/Keras, sci-kit learn, and PyTorch
Participation
- I'll build a new git repository with two branches. Code will be submitted to the dev branch regularly, and once checked and verified, it will be merged with the master branch.
- During my working hours (3:00 pm to 8:00 pm UTC +5:30), I will be available on IRC to collaborate with the mentors.
- For bug and subtask management, I'll use Phabricator and Zulip.
- During non-working hours, I will be available via Gmail to be contacted.
About Me
I am currently enrolled in the Netaji Subhas University of Technology in Delhi, pursuing a B.tech in Information Technology. I am a dedicated learner who works with full zeal and enthusiasm. I prioritize my commitments and balance each and every aspect scrupulously. This is my first time participating in the GSoC program. I heard about this program through our professors and Seniors. Since I won't have any other obligations during this summer, GSoC will be my top priority.
Wikipedia's vision of making content accessible in any natural language inspires and excites me. I assume that contributing to Wikimedia will have a positive effect on the learning community. Considering all the relevant learnings I expect to gain from this project, I contemplate this experience as an extensive skill-enhancing experience for my succeeding career. Making this project happen will be one of the greatest accomplishments I wish to achieve.
Past Experience
I've worked with C++, Java, Python, HTML, CSS, Javascript, and Node.js, among other languages. I have experience in MySQL among other databases. Among VCS(Version control system), git is most preferred. macOS is the operating system that I use the most.
- Individual Project -
- AI Image Captioning Bot - Built an internet bot that will take an image as an input and will predict a caption as its output. ResNet50 architecture was used along with transfer learning. A caption generator Module was built which was embedded in the website through Flask.
- Covid detection using X-ray - An X-ray dataset was preprocessed in a separate Jupyter notebook and a CNN-based model was built in another making use of a generator function to load the big dataset.
- Sentiment analysis - A Multiclass Naive Bayes classifier was implied in a movie review dataset. Natural Language Processing (NLP) was used to vectorized the dataset as well as to form a local vocabulary.
- Group Project -
- Cansat - A CanSat is a type of sounding rocket payload used to teach space technology. Along with a group of 9 members, we built a cansat. I was assigned to build a GUI and flight software model.
Microtask carried out
- ORES Documentation " https://ores.readthedocs.io/en/latest/ "
- Revscoring Documentation " https://revscoring.readthedocs.io/en/latest/ "
- Identify models to recreate " https://ores-support-checklist.toolforge.org/ "
