Page MenuHomePhabricator

Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis
Closed, ResolvedPublic

Description

Profile Information

Name: Chaitanya Mittal
IRC nickname on Freenode: chtnnh
Web Profile: https://www.github.com/chtnnh
Resume


Location: Dubai, AE
Typical working hours: 18:00 - 02:00 (UTC+4)

Synopsis

  • Short summary describing your project and how it will benefit Wikimedia projects

The current automatic classification system in place for the ptwiki is very naive and simply checks a few if conditions and places articles accordingly. There are 6 target labels that the existing system places articles into, 2 of which require editor approval. This model will be replaced with the improved ‘articlequality’ model to automatically label articles based on quality and ‘draftquality’ model to filter out drafts that are spam and/or vandalism.

This proposal elaborates on implementing ‘articlequality’ and ‘draftquality’ model for the Portuguese wiki by following a design like that in the English wiki based largely on the work done by Morten Warncke-Wang et al.

Such an implementation would require feature extraction from ptwiki, training various models on these features and fitness testing these models to find the best fit.

The immediate use cases of this model would be:

  1. Help increase the quality of automated article classification for ptwiki
  2. Streamline work for editors on ptwiki with respect to finalizing articles that need expansion, improvements or articles that can be featured.

The implementation would also pave the way for further work to be done in automating various wiki tasks for ptwiki.

Deliverables

Days/DatesMilestone/Deadline/Subtask Accomplished
Apr 27 - May 17Community bonding period: spend time interacting with analytics team at Wikimedia, understand common practices and norms
May 18 - May 24Preliminary research on features to be extracted from ptwikis
May 25 - May 31Completion and Integration of extractors for ptwiki
Jun 1 - Jun 7Testing for Extractors and Implementation of feature_lists
Jun 8 - Jun 14Testing feature_lists
Jun 15 - Jun 19Phase 1 Evaluations
Jun 22 - Jun 28Research various models for implementing articlequality
Jun 29 - Jul 5Implement top few models to benchmark performance
Jul 6 - Jul 12Testing and Implementation of top few models
Jul 13 - Jul 17Phase 2 Evaluations
Jul 20 - Jul 26Selection of top performing model
Jul 27 - Aug 2Streamlining footprint of selected model
Aug 3 - Aug 9Streamlining selected model and completing subtasks. Documenting the process and model for future reference in ORES engineering
Aug 10 - Aug 24 Final Evaluation

In addition to code, I plan to start a blog on my portfolio website where I will write about my work on this project once every two weeks. This will help with documentation as well as give certain exposure to Wikimedia AI projects.

Participation

In terms of participation, I plan to communicate mainly through five channels: Phabricator for documented information, IRC for general queries, Zulip for task specific queries and Email and team meetings for official communication regarding progress.

As far as source code is concerned, I have learnt that the best way to share code is through commits. But in cases where this is not the best option, services like https://codeshare.io could be handy.

About Me

Hi! I am Chaitanya Mittal, an undergrad in Computer Science and Engineering currently in my first year. I am an algorithmic coder and machine learning enthusiast. I have the distinction of qualifying to the Asia Regionals of the ACM ICPC 2018. I have worked with the Mozilla Foundation and the Mifos Foundation previously, though only for a short period of time. I am an open source enthusiast and truly believe in the power it holds to influence the world.

In particular though, I have fallen in love with Wikimedia's vision, "Imagine a world where we can all share freely in the sum of all knowledge" and the fact that it stays true to that. In the spirit of free knowledge and collaborative code, I believe Wikimedia leads by example.

The time frame for the project is from June to August. I will have summer break from July going on until August end. I will only have minor college engagement during the first two weeks of the project and I will strive to not let it affect my enthusiasm towards the project in any way.

This proposal has been selected for GSoC 2020

What does making this project happen mean to you?

Having relied on Wikimedia since childhood, without even realizing it, I understand the role that WIkimedia plays and has been playing in shaping how knowledge is shared around the world. The successful completion of this project would directly improve wiki quality for a language with more than 200 million native speakers. To be able to make a small difference in how 200 million people access knowledge would mean the world to me.

It would help a 19 year old realize that collaboration can lead to great things. This is what making this project happen means to me.

Past Experience

Having actively worked in open source for a year now, I have looked for a welcoming community working towards a cause I could relate with. In this process, I have encountered multiple projects (Mozilla, Mifos), developers and tasks. Although it is difficult for me to quantitatively describe this experience, I can affirm that it has helped me become a better developer, I have helped with some tasks here in the WikiMedia community as well!

T245068 is the first task that I have completed.
T246438, T246663 are tasks I am currently working on with @Halfak and have made significant progress in, as of the writing of this proposal.

At a personal level, I actively program competitively and keep myself up to date on the latest machine learning algorithms being developed. I love both Python and C although competitive programming does make me use C++ quite often. I am a native Linux and Bash user and prefer coding in vim or VisualStudio Code.

Any Other Info

References: T246663

Related Projects/Microtasks:

  • T246438 could be used as a microtask and the implemented features for text complexity can be utilized for all wikis instead of just enwiki.
  • Convert all extractors for various wikis to generators to handle 0 or more labels per template (currently all expect only 1 label per template)

Relevant Links:

This is the original proposal for the Google Summer of Code 2020 and the scope of the project has expanded. The final scope will be included in the reports that follow.

Event Timeline

We've already started the first steps here. @GoEThe, would you be interested in co-mentoring this project?

By "we" let me be clear that @Chtnnh has already started to pick up the preliminary work for this task.

We've already started the first steps here. @GoEThe, would you be interested in co-mentoring this project?

Sure, I would be happy to.

edit: I guess, before signing up, I should know what is the time commitment for this.

Sure, I would be happy to.
edit: I guess, before signing up, I should know what is the time commitment for this.

Thank you! Typically, mentors are expected to spend 4-5 hours per week for each student. :)

You may also refer to the following:

Chtnnh renamed this task from Proposal (GSoC / Outreachy 2020): Implement articlequality model for ptwiki to Proposal (GSoC 2020): Implement articlequality model for ptwiki.Mar 18 2020, 5:06 PM

@GoEThe Do you think you would be able to commit to mentoring this project? The reason I am asking is because I am expected to submit names of potential mentors in my final proposal

Thank you so much!

Hi, sorry. Things are a bit unstable at the moment. I don't think I can commit for that amount of time.

That's alright 😄

@srishakatux @Pavithraes Do you have any suggestions for me?

@GoEThe, could you recommend anyone else from ptwiki who might be able to help us understand the language and community needs?

@Darwinius said that he might have some time to help. And of course I can answer some questions as they appear, if time is not critical.

@Darwinius Hello! Do you think you would be able to help us out with this proposal?

Answering question with a 1-2 day lag would be perfectly acceptable. I think we'll primarily need support for local ptwiki and Portuguese language stuff. E.g. we're working on gathering data for the "drafttopic" model now and we'd like to have you check our assumptions on how we're interpreting the meaning of ER6 and ER20 deletion reasons. I expect to see more of that and maybe some help testing the models once we're ready to serve you predictions about some articles/drafts.

@Chtnnh hello! Yes, I hope so. What should I do? How can I help?

Hello Darwin! Me and @Halfak would like to submit this proposal to the coming Google Summer of Code program and require a second mentor from the Portuguese wiki community. @GoEThe suggested your name to us. What we would need from you is about 4 hours a week to answer some questions about the ptwiki and ascertain any assumptions we maybe making while developing this model. We would also require your assistance in testing the models once we're ready. Do you think it would be possible for you to commit your time to this?

The program lasts from Jun until August.

Chtnnh renamed this task from Proposal (GSoC 2020): Implement articlequality model for ptwiki to Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis.May 7 2020, 11:10 AM
Chtnnh set Due Date to Aug 23 2020, 8:00 PM.
Chtnnh updated the task description. (Show Details)

@Darwinius Hey Darwin! Can we find another way to collaborate and discuss the task? Something like irc where you can find me hanging in the #wikimedia-ai channel by the nick chtnnh. I am open to anything that works for you also.

Google-Summer-of-Code (2020) is over! I believe you have already documented your project here https://www.mediawiki.org/wiki/Google_Summer_of_Code/Past_projects#2020. If not, I would encourage you to do so. Also, is there anything else remaining in this task to address? If not, please consider closing this task as resolved.

Google-Summer-of-Code (2020) is over! I believe you have already documented your project here https://www.mediawiki.org/wiki/Google_Summer_of_Code/Past_projects#2020. If not, I would encourage you to do so. Also, is there anything else remaining in this task to address? If not, please consider closing this task as resolved.

@Chtnnh: Ping. Can you please answer?

@Chtnnh: Hi, could you please answer the last question?

Sorry for the delay in marking the task as resolved.

We have been able to successfully build and deploy articlequality and draftquality models for the Portuguese Wikipedia and had begun work on the models for Ukrainian and Hindi wikis.

Due to the change in long term plans with the ML team, the Ukrainian and Hindi wiki models have been put on the backlog until the foreseeable future.

@Chtnnh is now focusing energies towards helping the ML team get Lift Wing to production. Uk and Hi wiki models will be further developed after that goal has been achieved.

@Chtnnh: Thanks for the update, and your work! :)