Page MenuHomePhabricator

Proposal : Retraining models from ORES to be deployable on Lift Wing
Closed, DeclinedPublic

Description

Profile Information

Name: Paritosh Singh
Proposal PDF: T278261_Proposal_Paritosh
IRC nickname on Freenode: lskPari
Web Profile: Sharingan-Coder01
Location: Sharjah, AE
Typical working hours: 14:00 - 23:00 (UTC+4)

Synopsis

  • It is a demand for the constantly developing world of machine learning to always stay on the best solution possible. Being a regular volunteer contributor for the Wikimedia community for a few months now, I have had the opportunity to understand the requirement of this project for the community and why @calbon and the rest of the ML team plan to move to the next chapter for Wikimedia, which is shifting from ORES to Lift Wing.

ORES currently supports ‘articlequality’, ‘draftquality’, ‘drafttopic’, and ‘editquality’ ML models for various wiki communities and these models use a custom built library.
The first task at hand for this project will be to select models from enwiki (English Wiki) and retrain these models by shifting to more widely used and industry standard libraries rather than custom built libraries which are being used currently.
The models to be retrained during the tenure of the project are articlequality/enwiki.py | draftquality/enwiki.py | editquality/enwiki.py | articlequality/nlwiki.py
This project also sheds light on transition from ORES infrastructure to ‘Kubeflow’. Upon the successful completion of this project, the existing model will be replaced with the one built on the new infrastructure which will achieve equivalent or better performance using the original data used to train models previously.

  • Mentor(s): @calbon, @Chtnnh
  • Have you contacted your mentors already? Yes, I have been in contact with both the mentors.

Technical Background/Deliverables

The models are implemented in Python, thus building 3 jupyter notebooks for each different type of model which will contain the code for:

1)Data loading
2)Data preprocessing/Analysis
3)Feature engineering 
4)Model building 

Request --> Features --> Extracted features --> Model --> JSON response

My focus will be on implementation of the following in the re-trained models:
-> Re-trained models will focus on modified implementation of:

   1)Hyperparameter tuning
   2)Fitness metric generation
   3)Model training/Cross validation
   4)Fitness statistic generation
Scoring currently executes the tasks mentioned above and the new implementation will try to achieve better/equivalent fitness under new architecture.

-> Use of library Deltas which provides utilities for generating deltas (A.K.A sequences of operations) representing the difference between two sequences of comparable tokens.
-> Libraries which support faster and more accurate model training will be used.
-> Currently score uses extract_from_text function which extracts a set of values from a text and returns a cache containing just those values. A function simple in nature excessively utilizes the ORES pipeline and is very complex and can be improved in the re-trained model to work efficiently with Lift Wing.
-> A better way to handle language assets can be found. (This might exceed the scope of this project) I would love to work on this in future.

Jupyter notebook for semi-automating the process of re-training the model for different classes of models will be created, which will act as a standard to re-train the models in general across different communities.
The notebook will act as a go to guide for retraining models, which will contain the code for data loading and preprocessing. The notebook will teach how to extract feature values and also how to build own features in the improved re-trained model.
The notebook will include a generalized method for extracting datasources, debugging and also how to handle some common errors during data extraction. This will give a framework which can help re-training various models across multiple communities. The notebook will contain code which can be directly implemented by developers trying to re-train the models across multiple wiki communities.

The building process will be systematically documented, and a separate README for the repository will also be prepared.
-> It will be kept in complete consideration that the retrained models will effectively run on KFServings, and only KFServings supported Python libraries will be used.

Files to be submitted:

1)articlequality/enwiki.ipynb → Notebook1 containing re-trained model
2)Test and comparison for the re-trained model : File to compare the predictions from the existing models to the retrained model. Report on model performance testing.
3)draftquality/enwiki.ipynb → Notebook2 containing re-trained model.
4)Test and comparison for the re-trained model :  File to compare the predictions from the existing models to the retrained model. Report on model performance testing.
5)editquality/enwiki.ipynb → Notebook3 containing re-trained model.
6)Test and comparison for the re-trained model : File to compare the predictions from the existing models to the retrained model. Report on model performance testing.
7)articlequality/nlwiki.ipynb → Retraining dutch wiki articlequality model to attempt retraining for communities other than English wiki, in order to get a better picture of the challenges one might face while retraining models for different communities. Jupyter Notebook [4].
8)Jupyter notebook [5] for semi-automating the process of re-training the model for different classes of model.
9)Proper inline documentation in all code.
10)Blog posts on portfolio website regarding progress and learning experience.
11)Readme file describing the project for repository.

Timeline:

Days/DatesDescriptionMilestone Accomplished
17 May - 23 MayWeek 1 of Community Bonding periodThis time will be spent in interacting with the analytics team at Wikimedia and learning the common norms and practices followed in the community
24 May - 30 MayWeek 2 of Community Bonding periodHaving already contributed to Wikimedia for several months by this point, this time can be spent partially by setting up the prerequisites for the project, and discussing the workflow of the project with mentors.
31 May - 6 JunWeek 3 of Community Bonding periodUnder the guidance of the mentors, this time can be utilized to start preliminary work on the project.
07 Jun - 13 JunWeek 1 of CodingImplementation of data loading and extraction across all three models to be retrained.
14 Jun - 20 JunWeek 2 of CodingData analysis preprocessing, and preparing will be done throughout the week in order to get the best prediction while using new libraries.
21 Jun - 27 JunWeek 3 of CodingTraining and debugging the new models and improving fitness of the retrained model.
28 Jun - 04 JulWeek 4 of CodingTesting and comparison of model prediction of retrained models with the one that exists currently: to check for results of the work and how the re-trained models could improve. Review the features with mentors to improve the model and achieve better fitness more efficiently.
05 Jul - 11 JulWeek 5 of CodingCommencement of retraining articlequality model from nlwiki (Dutch Wiki) ,starting from data preprocessing, and analysis.
12 Jul - 16 JulEvaluation with mentorReview work done since the beginning of the program. Submit evaluations on mentors.
17 Jul - 23 JulWeek 6 of CodingTraining and debugging of the retrained model for articlequality/nlwiki.ipynb Testing and comparison of model prediction of retrained articlequality/nlwiki.ipynb alongside mentors and members of the dutch wiki community to ensure satisfactory model fitness.
24 Jul - 30 JulWeek 7 of CodingStarting the build for a semi-automated process for retraining models by using the knowledge earned while retraining the three mentioned models from enwiki.
31 Jul - 06 AugWeek 8 of CodingTraining/Debugging the Jupyter Notebook containing semi-automated process for model retraining to extract best fitness possible with the help of mentors.
07 Aug - 13 AugWeek 9 of CodingTesting and fitness check for semi-automated process built for model retraining. Reviews and extraction of best fitness possible alongside mentors.
14 Aug - 15 AugWork done since Phase 1 Evaluations will be self reviewed, and selection of top performing models will be done.
16 Aug - 23 Aug Final Evaluation Ensure models are released into production, assist integration and deployment. Document all the changes made. Blog post about learning experience will be made and submission of final evaluation on mentors will be done.
24 Aug OnwardsFuture with WikimediaWorking on expanding model retraining across other wikis, and assisting new developers at doing so too. Work with @calbon & @Chtnnh to expand work in automating the model retraining process. Add new, relevant features to the new model to improve model performance and reliability. Review the retrained models and collect feedback from wiki community.

Benefits to Community:

Upon successful completion of this project:
->Wikimedia community developers will have a framework for re-training of models existing across various communities, which are currently running on ORES.
->The semi-automated process given by the project will act as a catalyst in the process of retraining models.
->They will be able to experience an easier and time saving process for the model building and maintenance.
->Model building procedure will move on from custom built libraries to much more known and widely used libraries, which in turn will help all future developers wanting to contribute as they will not have to learn custom built library for model building.

Since the project is being built taking into consideration that WMF is moving from ORES to Lift Wing in near future. Retraining the model will be done following the architecture Lift Wing will be using, in-order to use the project in the long run with ease and very few changes required to it.
This project will stand as an example to use the best practice in ML. Better models directly affect wiki editors and contributors from multiple regions, and indirectly affect millions of people using wiki giving them better experience. This project will pave the way to retrain further models from different wikis.

Participation

I plan to communicate mainly through the following channels:
Phabricator for the documented information and updates on the project. I am available on IRC for general queries. Platforms such as Zulip and personal email can be used as a platform for task specific queries. I am reachable at all working hours for team meetings and official conversations regarding the progress, updates and further planning for the project.
I will keep updating the source code through commits, as I find it to be the best way to share code. In case this does not turn out to be the best option, services like Codeshare could be a big help.

About Me

I am Paritosh Singh, a sophomore student in Computer Science and Engineering, B.Tech, in one of India’s premier institutes NIT Kurukshetra, with deep interest in the fields of programming, development and data analytics.
I am a quick and enthusiastic learner, a team-player with considerable leadership qualities, diligent, resourceful, and a pragmatic worker. I am interested in learning new things and devising a better problem-solving method for challenging tasks, and learning new technologies and tools if the need arises. Skilled in Python, C++, ML, DSA, Public Speaking. I have native or bilingual proficiency in English and Hindi.
I am an open source enthusiast and I truly believe in the capability it holds to influence and contribute to the world. Collaborating with a huge and diverse community ranging from programmers, editors, and various volunteers from across the globe has helped me learn and improve my soft skills, and has given me an insight into the working of an organization.
In my 4 months with Wikimedia, I have come to truly believe in its vision, "Imagine a world where we can all share freely in the sum of all knowledge". I hope to be a long time contributor in Wikimedia’s spirit of free knowledge and collaborative code.

Apart from academics, I am also an avid reader and regularly contribute with write ups to my college magazine Eunoia. I wish to increase my working and personal life proficiency to be an efficient and experienced professional and an ideal global citizen.

I came to know about this program through my college technical magazine. I always wanted to contribute to open-source organization and started volunteering with the ML team at WMF.

The time frame for GSOC’21, as announced, is from mid-May to late-August. I will have an ongoing summer break during this period, I can guarantee a 100% dedication in the given time frame.
I might have only a minor college engagement during the last 2 weeks of the project but I will strive to not let that be a blocker for my enthusiasm towards the project in any way.

I am eligible to participate for GSoC

What does making this project mean to me?

I understand the role that WikiMedia plays and has been playing in shaping, how knowledge is shared around the world. Having actively worked in open source specifically with Wikimedia for a few months now, I can be sure that I am on the right path to achieve my goals. I understand the responsibility being put on me and how completion of this project will play a big role in all the changes being bought in Wikimedia.
It would help me realize that collaboration can lead to great things. The project will help developers directly and millions of Wikipedia users indirectly everyday and this gives me the satisfaction of collaborating for greater good.

Past Experience

I decided to contribute to Open Source platforms in order to increase my knowledge & work experience in my chosen field for career and bring about change to a large community. Having actively worked in open source specifically with Wikimedia for a few months now, I can be sure that I am on the right path to achieve my goals.
I have extensively worked throughout the development of nlwiki articlequality model previously, and have made contributions in the process of articlequality model building T223782. I will receive assistance from @Halfak in the tenure of this project too.

Any Other Info

RELATED TASKS: Prepare 4 ORES English models for Lift Wing originally created by @calbon T272874
Microtasks Completed:

-> Understand the ORES architecture at a high level.
-> Understand the revscoring architecture at a high level. 
-> Identify models to recreate by going through the list of current models.

Event Timeline

Hey @Psingh07

Thanks for showing your interest to participate in Google Summer of Code with Wikimedia Foundation! Please make sure to upload a copy of your proposal on Google's program site as well in whatever format it's expected of you, include in it this public proposal of Phabricator before the deadline i.e April 13th. Good luck :)

Hey @Gopavasanth !

Thanks for the reminder. I have already shared my proposal draft on Google's program site . Looking forward to valuable feedback from the community and the mentors.

Hey Paritosh! Great job on the proposal so far. Would it be possible for you to elaborate on how you will be semi-automating the retraining process?

Hey @Chtnnh Thanks for the review. I have elaborated on what the jupyter notebook for semi-automating the retraining process will contain. I have also added how it is going to help the future developers. I hope this makes the proposal more refined. I would love to know your thoughts on the updated proposal.

GSoC application deadline has passed. If you have submitted a proposal on the GSoC program website, please visit https://phabricator.wikimedia.org/project/view/5104/ and then drag your own proposal from the "Backlog" to the "Proposals Submitted" column on the Phabricator workboard. You can continue making changes to this ticket on Phabricator and have discussions with mentors and community members about the project. But, remember that the decision will not be based on the work you did after but during and before the application period. Note: If you have not contacted your mentor(s) before the deadline and have not contributed a code patch before the application deadline, you are unfortunately not eligible. Thank you!

Hello @srishakatux !

This specific project did not require the applicants to submit a code patch before the application deadline. Although it did require them to do research, similar to the project I mentored in Outreachy Round 21. All the submitted proposals are based on the research conducted by the applicant. In addition, all the applicants have been in contact with either Chris or myself and hence all of them are eligible.

@Psingh07 Kindly follow the other instructions that Srishti has given in her previous comment. Thank you.

@@Psingh07 ​We are sorry to say that we could not allocate a slot for you this time. Please do not consider the rejection to be an assessment of your proposal. We received over 100 quality applications, and we could only accept 10 students. We were not able to give all applicants a slot that would have deserved one, and these were some very tough decisions to make. Please know that you are still a valued member of our community and we by no means want to exclude you. Many students who we did not accept in 2020 have become Wikimedia maintainers, contractors and even GSoC students and mentors this year!

Your ideas and contributions to our projects are still welcome! As a next step, you could consider finishing up any pending pull requests or inform us that someone has to take them over. Here is the recommended place for you to get started as a newcomer: https://www.mediawiki.org/wiki/New_Developers.

If you would still be eligible for GSoC next year, we look forward to your participation