Page MenuHomePhabricator

Proposal : Retraining models from ORES to be deployable on Lift Wing
Open, Needs TriagePublic

Description

Profile Information

Name : Anubhav Sharma
IRC nickname on Freenode - anubhav_sharma
Pdf Proposal - T279961_Anubhav
Resume - Anubhav_resume.pdf
Web Profile - anubhav-sharma13
Location - Hyderabad , India
Typical working hours -10:00 - 20:00 ( UTC +5:30)

Synopsis

  • The current deployed Machine Learning architecture at ORES uses its models thus helping in maintaining safer data for users of Wikimedia . The existing infrastructure of ORES , which has been enabling ML at the WIkimedia foundation for the last 6 years , needs to be made scalable and accessible . For “drafttopic” and “articletopic” models , it is a multilabel classification task with 64 categories whereas model for “editquality” classifies as an edit if to be reverted or not ( on the basis of if it is ‘goodfaith’ or ‘damaging’ ) .
  • This proposal describes the implementation of models to be deployable on Lift Wing Architecture for the enwiki , where “editquality” ,“drafttopic” and “articletopic” models will work without the dependencies on revscoring . The work plan will follow a dual pipeline , where for each model I will try multiple approaches on the basis of hardware available , with additional features on top of the existing features present in revscoring .
  • Upon successful completion of this project , the enwiki will be having reliable , scalable models for “articletopic” , “editquality” and “draftopic” , which will look forward to outperform the existing models , with the help of newer features and SOTA techniques known in the NLP world .
  • This enhanced architecture will be affecting the millions of people using the wiki allowing them to have a better experience. Since the automated annotation of data for edits or articles will be more accurate , it will also reduce the need for human intervention .
  • Mentors - @Chtnnh @calbon
  • Have you contacted your mentors already? - Yes , I have been in contact with @Chtnnh regularly .

Technical Description

  • The models will be coded in the python programming language and the codes will be delivered in Jupyter Notebooks .
  • The workflow will require us to create an extractor class which will fetch the required data for creating the features as well as the inputs to the models .
  • After getting the data , the work plan will be split into two plans : Plan A & Plan B .
    • Plan A - In this plan , I will be limiting myself to retrain the model on statistical based approaches and will spend more time in extracting features rather than training the models .
      • This Plan will be considered in case the hardware isn’t equipped to handle Neural Network Approaches ( Lacks enough memory or GPU) or if the Plan B isn’t getting good results .
    • Plan B - In this plan , our main target will be to replace the existing models with deep learning approaches . This plan would require more execution time as compared to Plan A ( but will be compensated by the less time required for feature extraction ) .
  • Feature Extraction - There are multiple features which are used by the ORES architecture by using feature modules from revscoring ( like the statistics of pronoun counts for different genders for drafttopic or some statistics for number of occurrences from a set of words {bad words} in editquality ) and I will try to reuse them .
  • Enhancements - There will be some other approaches that I will be using to enhance my models’ performance . Following is their description
    • Change In word Embeddings - I will be changing the embeddings from word2vec ( context independent ) cite to pretrained bert embeddings/fastText cite embeddings or fine tuned bert embeddings on the basis of availability of resources .
    • Another technique that I will follow is ensembling . This will be used in Plan A to enhance the performance , while staying on limited hardware capability . For this we will need an odd number of Classifiers , which can give us a consensus for ensembling . Following are the algorithms that , I am going to use other than XGBoost :
      • Catboost - It is a well performing algorithm and there are other features of it which interpret the features of the models ( something which can be useful for feature engineering ) . cite
      • Light GBM - A lightweight algorithm which has lower memory usage with better efficiency . This will be added to boost our results without increasing a lot of load on the existing system . cite
    • Classifier - I will change the core classifier from statistical based ( XGBoost ) classifiers to modern deep learning models ( for example Transformer models like Bert etc. ) cite .
    • NER Based Feature - Name entities act as a great feature for classification tasks as they hold the information about the nouns being discussed about in a particular text cite. The embeddings of NER will be taken as a feature in account . The NER ‘s occurrences after taking into account are to be appended as a feature vector at the end of embeddings .
    • GNN - I will use Graph Neural Network to encode correlation among categories ( drafttopic and articletopic ) by generating latent representation of categories . This has worked for me in my publication at NTCIR-15 cite
    • Inclusion of Article Introduction - Introduction holds a great importance for a particular text . It summarises the information available in an entire text . So I will encode the entire Introduction of an Article ( articletopic ) . If it exceeds the limit of 512 tokens of Bert , I will use Reformer cite , to encode larger texts .

Deliverables

  1. editquality.ipynb : Jupyter Notebook - contains the source code for “editquality” model . It will have modules to implement individual functionalities for :
    1. data_preprocessing & data_loading
    2. feature_extraction & feature_engineering
    3. training & testing
  2. editquality_config.txt - will consist of the features and configurations used for the best model .
  3. editquality_report.pdf - Complete Analysis of performance of model , with suitable visual methods .
  4. editquality_analysis.py - Here we will send all the predicted values , data points and features used . Will make the visuals ( histograms , graphs ( regarding predictions ) , correlation matrix etc . ) . Here all the performance evaluation of the editquality model will take place .
  5. editquality.md - A markdown file with complete instructions to run the code . Will also contain small examples of datapoints to explain the data format .
  6. articletopic.ipynb : Jupyter Notebook - contains the source code for “articletopic” model . It will have modules to implement individual functionalities for :
    1. data_preprocessing & data_loading
    2. feature_extraction & feature_engineering
    3. training & testing
  7. articletopic_config.txt - will consist of the features and configurations used for the best model .
  8. articletopic_report.pdf - Complete Analysis of performance of model , with suitable visual methods .
  9. articletopic_analysis.py - Here we will send all the predicted values , data points and features used . Will make the visuals ( histograms , graphs ( regarding predictions ) , correlation matrix etc . ) . Here all the performance evaluation of the articletopic model will take place . The analysis of data distribution will also be done over here in terms of multilabel distribution and their frequencies .
  10. articletopic.md - A markdown file with complete instructions to run the code . Will also contain small examples of datapoints to explain the data format .
  11. drafttopic.ipynb : Jupyter Notebook - contains the source code for “drafttopic” model . It will have modules to implement individual functionalities for :
    1. data_preprocessing & data_loading
    2. feature_extraction & feature_engineering
    3. training & testing
  12. drafttopic_config.txt - will consist of the features and configurations used for the best model .
  13. drafttopic_report.pdf - Complete Analysis of performance of model , with suitable visual methods .
  14. drafttopic_analysis.py - Here we will send all the predicted values , data points and features used . Will make the visuals for analysis and will also try to understand about the label distribution .
  15. drafttopic.md - A markdown file with complete instructions to run the code . Will also contain small examples of datapoints to explain the data format .
  16. README.md - To consist of general introduction about the project and overall file structure

Project Timeline

Dates/DaysDescriptionProgress
May 17 – May 23Week 1 of Community Bonding periodIt will be the time spent to know deeply about the Wikimedia Community and explore the norms and standards followed in the community . Also get acquainted with the different teams during this time .
May 24 – May 30Week 2 of Community Bonding periodDuring this time , most of the issues regarding the working environment will be resolved by collaborating with the mentors and other teams . Hardware needs to be tested before going on with a particular pipeline and that will be done in this period .
May 31 - June 6Week 3 of Community Bonding periodThis time will be spent to analyse and discuss how I can make my approach more efficient on the existing hardware . This period is very crucial for deciding my pipeline if I will go with Plan A or Plan B . Also will do a study regarding the dataset for editquality regarding the distribution of labels ( to adjust my techniques if the data is more skewed ) . Other than that , I will spend this time interacting with other folks from the Wikimedia Community to explain about the problem statement so that when I ask them for an evaluation of my results , it would be easier for them .
June 7 – June 13Week 1 of codingPlan A - Will implement data_preprocessing , data_loading & feature_extraction(extracting existing features of revscoring ) for editquality and align them for training . Plan B - All the preprocessing and will run the model without any specific features to get the baseline results . ( It will be the baseline neural network classifier with pretrained embeddings for editquality ) .
June 14 – June 20Week 2 of codingPlan A - Will implement feature_engineering for editquality . We create a smaller dataset to test (very small ~ 10 % of the test set )the efficiency of each new feature added . Then will test the classifiers( here by classifiers I mean , models with different classifying algorithms) on the complete test set itself .( The NER word embeddings feature ) . Report featuring analysis of the model's performance is made . Plan B - Will fine tune embeddings and add features in the neural network for the editquality task . Will test it on the test set and fine tune to get the best model . Report featuring analysis of the model's performance is made .
June 21 – June 27Week 3 of codingPlan A - Before beginning the multi label categorisation , I will do an extensive study over the distribution of datapoints & labels , ratio of datapoints with multiple labels , average number of datapoints each label is present in( to adjust my training strategy ) and average number of labels across datapoints . Then will do the part of preprocessing and will extract the features and format them according to the model of drafttopic . Plan B - After analysing the data properly , I will process the data and feed it to the baseline neural network . (It will be the baseline neural network classifier with pretrained embeddings for drafttopic )
June 28 – July 4Week 4 of codingPlan A - On the basis of the study conducted in the prior week , will mine more features and test them accordingly on test data ( The NER word embeddings feature ) . If features follow a certain criterion ( like for K-fold training , we need to divide in such a way that no fold is left with a datapoint , without a label ) we will use a particular training method for the different classifier algorithms for drafttopic model . Plan B - In accordance with the analysed output will evaluate on the test set and fine tune to get the best model . Will debug the code properly as I need to train the GNN to get the correlation between labels , next week .
July 5 – July 11Week 5 of codingPlan A - Will analyse complete performance and will fix all leftover bugs and clear all other documentation requirements .(Including reports) ( for both drafttopic and editquality ) Plan B - Will train GNN for drafttopic and use the output as features in the classifier model for drafttopic . Then will analyse the performance . Will complete documentation (for both drafttopic and editquality ) .
July 12 – July 16Phase 1 Evaluations
July 17 – July 23Week 6 of codingFirstly , we make amendments as suggested in the Phase 1 Evaluations . Plan A - Will again do the analysis on the data of articletopic , before diving into the implementation . It will again follow with the data treatment and feature extraction as been used previously for drafttopic . Plan B - Will use visuals and statistical methods to estimate data features before loading the data for the baseline model of articletopic . Then will perform data_preprocessing , data_loading on the data to get the baseline results . (It will be the baseline neural network classifier with pretrained embeddings for articletopic ) .
July 24 – July 30Week 7 of codingPlan A - Will engineer features according to the performance on existing features and then will train according to the data distribution . Plan B - Will fine tune embeddings and add features in the neural network for the articletopic task . This time we will try to use K-fold training depending on the label distribution and then will fine tune it to get the best model . Will debug the code properly to ensure a smoother transition to the GNN module .
July 31 – August 6Week 8 of codingPlan A - Will analyse complete performance and will fix all leftover bugs and clear all other documentation requirements of articletopic . Plan B - Will train GNN for articletopic and use the output as features in the classifier model for articletopic . Then will analyse the performance . Will complete documentation for articletopic .
August 7 – August 15Week 9 of codingFor both the plans this will be the final week before submission , so will ensure models are released in a handy state with all necessary comments and simple class structure to increase readability of code .
August 16 – August 23Final EvaluationsWill submit the final src code with all the documentations .

MicroTasks

  1. ORES Documentation - Analysed through the ORES documentation and understood its working .
  2. Revscoring Documentation - Analysed through the Revscoring documentation and understood its methodologies to perform ML related tasks.
  3. Identify models to recreate - Selected the three models that I will be recreating .

“Torture the data, and it will confess to anything” . As Ronald Coase said , I also looked at various articles and edits to look for techniques to make my features more feasible .

Participation

Following are the channels that I will be using for communicating :

  1. Phabricator - for documents regarding information
  2. IRC - for general queries
  3. Zulip - for task related queries
  4. Teams - For meetings
  5. Mail - For official communication

For source code sharing and development , I can work with any of the Version Control System (VCS) but Github is preferable .

About Me

  • I am Anubhav Sharma , a third year undergrad researcher in Computer Science and Engineering at IIIT Hyderabad , India . I have been an algorithmic programmer and NLP researcher for the past 2 years . I have undertaken many NLP related projects including a Transformer based Fake neural network , Query Based Summarisation and Building a search Engine from scratch(available on Github ) . I am currently working in the field of Fact -Alignment for Hindi Language . I also have a publication in NTCIR -15 [4] in the field of fine grain categorisation of Wikipedia Articles . All of my research work till now aligns with the IndicWiki Project and the idea of open source in general excites me a lot .
  • How did you hear about this program? From my seniors at college and then from Internet .
  • Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program? The time frame for the project is from mid-May to mid-August. I will have summer break from the end of April till the mid of August . Hence I will bring my full attention to the commencement of this project and I will strive to excel and be able to contribute to it .
  • What does making this project happen mean to you? - The idea of learning as seen through the eyes of Jimmy Wales , when he started Wikipedia as to how important education is for everyone , makes me ponder about how big of a change my efforts can result in . Wikimedia as an organisation is another such brilliant organisation which gives me a platform to which I will refer to as - “work for all is work for yourself “ . This project as an opportunity means immense to me . An opportunity in terms of how impactful it will be to the people across the globe .

Past Experience

I have been actively working in the IndicWiki Project of Information Retrieval and Extraction Lab , IIIT Hyderabad {link} for the past one year . It is a project in which we try to enrich the Wikipedia content in Indian languages ( currently it is for Telugu and Hindi ) . This project aligned with my ideology of working for a cause and has transformed me into a better developer . The learning and habits that I developed due to the project are really difficult to pen down . Since this project is also closely linked to the WMF and has a similar cause of well being , it will again be an inspiring experience to contribute .
On the technical front , I am an NLP researcher and am up to date with state of the Art techniques used in the deep learning field of NLP . I have a high preference for python but the language selection relies heavily on the downstream task ( so if it is Competitive Programming then I prefer C++ , if it is for lower level programming which involves OS related tasks then I prefer C ) . I am a linux preferred user ( currently on Ubuntu ) and my preferred IDE for coding is code or vim . For other projects that I have done , please refer to my Github .

Event Timeline

GSoC application deadline has passed. If you have submitted a proposal on the GSoC program website, please visit https://phabricator.wikimedia.org/project/view/5104/ and then drag your own proposal from the "Backlog" to the "Proposals Submitted" column on the Phabricator workboard. You can continue making changes to this ticket on Phabricator and have discussions with mentors and community members about the project. But, remember that the decision will not be based on the work you did after but during and before the application period. Note: If you have not contacted your mentor(s) before the deadline and have not contributed a code patch before the application deadline, you are unfortunately not eligible. Thank you!

Hello @srishakatux !

This specific project did not require the applicants to submit a code patch before the application deadline. Although it did require them to do research, similar to the project I mentored in Outreachy Round 21. All the submitted proposals are based on the research conducted by the applicant. In addition, all the applicants have been in contact with either Chris or myself and hence all of them are eligible.

@Anubhav-sharma13 Kindly follow the other instructions that Srishti has given in her previous comment. Thank you.

What's "Lift Wing"? You might want to include a sentence to describe it, or a link.