Page MenuHomePhabricator

Productionize Wikidata-based Topic Model on ORES
Open, Needs TriagePublic

Description

NOTE: I have closed this task to new contributors as of Thursday 19 March at 10pm UTC. If you have submitted an initial contribution for the task, you may continue to work on the task through the April 7th deadline.

Brief summary

The Wikimedia Foundation runs a service called ORES that hosts machine learning models that can make predictions about various forms of content on Wikipedia -- e.g., the likelihood that a given edit is vandalism or how the quality of a Wikipedia article. One of the newer models is one that can label Wikipedia articles with a set of pre-defined topics -- e.g., the English Wikipedia article for sci-fi author N. K. Jemisin is predicted to be part of the following topics:

  • Culture.Biography.Biography* (the article is a biography)
  • Culture.Biography.Women (the biography is about a woman)
  • Culture.Literature (she's an author).

The challenge is that this model only works for English Wikipedia, and while efforts are being made to expand it to more languages, this is difficult. To overcome this challenge, a separate model was developed to make predictions based not on articles but on Wikidata (loosely a database of facts about concepts that have Wikipedia articles -- e.g., the Wikidata item for N.K. Jemisin). This model can be used to generate topic predictions for Wikipedia articles in any language based on its associated Wikidata item (yay!). We have developed an experimental API but this project will rewrite the code for this model so that it works in the production-level ORES environment. While this is primarily an engineering task, there will also be opportunities for machine learning and data science as desired.

Skills required

  • Python coding -- the code for this API will be in Python so at least some prior experience will be necessary.
  • Jupyter Notebooks: this will likely only be used for this initial application but is also a very useful medium for sharing code and analyses. If you have prior experience with Jupyter notebooks, great! If not, we can help you learn it.
  • Basic understanding of machine learning. We can incorporate more or less machine learning work in this project as needed though.

Possible mentor(s)

@Isaac @Halfak

Microtasks

Each applicant will submit a Jupyter notebook that demonstrates an ability to work with the Python code and features that comprise the topic classification model as well as the ability to do some basic model evaluation. Note that unlikely many Outreachy tasks, we are not asking each applicant to claim a task but instead to all work independently on the same task. Feel free to help each other out though! This task is described here: T246013

Further Reading

Research paper with more background on topic classification models at Wikimedia: https://dl.acm.org/doi/10.1145/3274290
Overview of existing ORES topic models: https://www.mediawiki.org/wiki/ORES#Topic_routing

Event Timeline

Isaac created this task.Feb 21 2020, 5:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 21 2020, 5:09 PM
srishakatux removed a project: Outreachy Mentors.
srishakatux changed the visibility from "Public (No Login Required)" to "Outreachy Mentors (Project)".
srishakatux moved this task from Backlog to Featured projects on the Outreachy (Round 20) board.
Isaac updated the task description. (Show Details)Feb 24 2020, 4:44 PM
srishakatux removed Isaac as the assignee of this task.Feb 24 2020, 11:26 PM
srishakatux changed the visibility from "Outreachy Mentors (Project)" to "Public (No Login Required)".Mar 5 2020, 6:15 PM
Isaac updated the task description. (Show Details)Mar 5 2020, 9:16 PM

Hello! I'm Hammad, a final year undergrad student from Pakistan, and I was wanting to work on this project for my Outreachy application. Having an affinity for Python development and Machine Learning this project seems almost tailor-made for me. I hope to have a great experience working on this and will be bugging people for help all the way (sorry in advance for that xD). I look forward to working with you all!

Hi, I am Soniya Nayak, third year undergrad from IIT(ISM) Dhanbad, India. I have a fair experience with NLP and deep learning and would like to contribute here for Outreachy 2020!

Hello, I am Naila from Pakistan. I have no experience in machine learning but will try my best to contribute.

Hello, I am Ange from Cameroon. I am an Outreachy applicant for 2020. I have basic knowledge of machine learning but am good in python and i would like to contribute in this project.

Hello, I am Mugdha, a CSE student from India, and an Outreachy 2020 applicant. I have worked on several NLP based projects in the past 2 years, and am more than excited to be a part of this group to learn and to contribute using my skills!!

Hello, I am Ida from Cameroon. I have been working with Python for quite a while and have very little knowledge in machine learning but I will give this my bet shot.

Isaac added a comment.Mar 6 2020, 2:01 PM

Hi @Hammad.aamer @Soniya51 @NailaNeena @AngelVicks @Mugs-iiit @Idadelveloper! Thanks for the introductions and glad to see the excitement. The subtask listed under "Microtasks" is ready to be worked on. As I noted in the task description, no one person is claiming this task. You will each create a PAWS notebook that you can then link to in your Outreachy contribution. Feel free to ask questions and help each other -- @Halfak and I will jump in when there are bigger questions and to provide guidance as necessary but hopefully you can work out many of the smaller challenges with each other's help.

Isaac added a comment.Mar 6 2020, 6:55 PM

Just an FYI as well: many of you are students or have jobs and thus plan to work on this task during evenings or weekends. That is perfectly fine obviously and we're glad you're able to make the time to apply. When it's outside of my working hours though (weekdays; Central Time UTC-6), feel free to keep asking questions but I just ask for patience as my response time might be slower. Thanks!

Hi everyone,

This is Agha Saad Fraz from Pakistan, pursuing Software Engineering. I am interested in applying to Wikimedia for Outreachy. I have seen the ideas, The idea that has attracted me the most is:

Productionize API for Wikidata-based Topic Model

I have selected this idea as this is aligned with my interests. I have been developing and working with machine learning models and APIs for more than a year. I use python and its libraries for that.

P.S I am going to work on Outreachy application task :)

Hi! I am Kainaat. I am currently doing B.tech in CSE and I'm interested in applying to Wikimedia for Outreachy. I found this project "Productionize API for Wikidata-based Topic Model" very interesting and would like to apply for this project in my Outreachy application.

Hi, I am Muskan, an undergraudate in Computer Engnieering. I am so excited to contribute to this project.

Hello everyone, my name is Catherine Tushabe and Electrical Engineer from Uganda. I am currently pursuing post-graduate Diploma in Data science and I would like to contribute to this project

Hi, I'm Diksha, an outreachy applicant. I have been following Wikimedia for a year and I'm glad to contribute to this project :)

Hey! I am Srija, an undergraduate student at IIIT Delhi. Looking forward to contributing to this project as Outreachy applicant.

Sargamm added a subscriber: Sargamm.Mar 9 2020, 1:51 PM

Hi Everyone! i'm Sargam, an outreachy applicant from IIIT-Delhi. Hoping to be an active contributor here.

Hi everyone!

My name is Daniel, and I am going to contribute to this project as part of my final application to Outreachy. I am a bit nervous, but very excited to get started. I have some experience in Python, and absolutely love Wikipedia (who doesn't!) so I am excited this aligns with my interests in life long learning, and very much look forward to learning and collaborating. I live in Chicago, IL and things are thawing out here, so I will hopefully be able to come out from hibernation and contribute while at my local Starbucks or one of my favorite neighborhood libraries.

Again, very excited to contribute and learn more about ORES!

Hi all!

My name is G.S.S.N.Himabindu, pursuing Computer Engineering at DTU, India. I have some experience with deep learning. This project aligns very well with my interests. Excited to contribute and learn from this project.

Hi all! My name is Shahlo. I am originally from Turkmenistan, but currently live and work in Indianapolis, Indiana (USA). I don't have a degree in Computer Science or any other discipline related to it. I am self-taught in Python and Ruby. I am sure I don't come anywhere close to those who possess wide experience and understanding of the topics we will be contributing to, so I am already feeling a bit intimidated with the assignment. But I am here to learn, so I plan to do my best!

Hello all! I am Kriti from Hyderabad, India and I am a final year computer science project. I look forward to contributing to the project.

Isaac added a comment.Mar 11 2020, 3:22 PM

Welcome @Sargamm @Dzdaniel @Himabindu @Sseidmed @Kritihyd -- excited to have you and hear about the diversity of skills and background. Good luck and don't hesitate to ask questions and read through others' past questions as you get started!

Hey all! I need someone's help. I am trying to understand how APIs work. Is it better to add parameters as a separate dictionary and pass it as an argument alongside the URL to .get method or pass the parameters within the URL itself? I can screenshot my problem as well if needed. Apologies, I have never worked on open source or even asked for help online.

Isaac added a comment.Mar 12 2020, 2:37 PM

Hey all! I need someone's help. I am trying to understand how APIs work. Is it better to add parameters as a separate dictionary and pass it as an argument alongside the URL to .get method or pass the parameters within the URL itself? I can screenshot my problem as well if needed. Apologies, I have never worked on open source or even asked for help online.

Hey @Sseidmed if you are using a library that accepts a dictionary of parameters, that is usually going to be best as that library will also hopefully take care of encoding and some of the other things that you can do incorrectly if you try to generate the URL yourself. It is often much easier to understand the code as well, as opposed to trying to figure out everything that is included in a URL.

Hey all! I need someone's help. I am trying to understand how APIs work. Is it better to add parameters as a separate dictionary and pass it as an argument alongside the URL to .get method or pass the parameters within the URL itself? I can screenshot my problem as well if needed. Apologies, I have never worked on open source or even asked for help online.

Hey @Sseidmed if you are using a library that accepts a dictionary of parameters, that is usually going to be best as that library will also hopefully take care of encoding and some of the other things that you can do incorrectly if you try to generate the URL yourself. It is often much easier to understand the code as well, as opposed to trying to figure out everything that is included in a URL.

@Isaac Thanks a bunch! I wasn't sure what the routine practice was in implementing parameters. I was also trying to figure out where one gets the revision ID to pass. I know one way is to go to "View History" on wikipedia article's page and find the latest revision ID. Is there any other way to generate it? I am using list=pagepropnames in my API url, but no revision ID shows up in my query.

Dzdaniel added a comment.EditedMar 12 2020, 11:23 PM

Hi all! My name is Shahlo. I am originally from Turkmenistan, but currently live and work in Indianapolis, Indiana (USA). I don't have a degree in Computer Science or any other discipline related to it. I am self-taught in Python and Ruby. I am sure I don't come anywhere close to those who possess wide experience and understanding of the topics we will be contributing to, so I am already feeling a bit intimidated with the assignment. But I am here to learn, so I plan to do my best!

Hey Shahlo! I also do not have a CS degree, but I will definitely plug away using every available resource. Currently YouTube is helping with a tutorial about Jupyter notebooks. In case anyone is interested : https://www.youtube.com/watch?v=HW29067qVWk&t=130s

Cheers!

Hey! I am Larika , an undergraduate student at IIIT Delhi. Looking forward to contributing to this project as Outreachy applicant.

Dzekem added a subscriber: Dzekem.Mar 13 2020, 1:53 AM

Hi I am Dzekem Christa from Cameroon and I look forward to having a great time during the contribution.

Hi everyone. I am Shailza and I am a second-year undergrad from India. I have worked with Python previously and would like to contribute to this project. Looking forward to learning a lot in the process.

Hi everyone. I am Ubong Joshua from Nigeria. I am beginner in Python and still developing my skills. I look forward to contributing to this project and learning a lot as well.

Hi I am Tanvi a pre-final year undergraduate student from India. Looking forward to contributing to Wikimedia. I have a question @Isaac apart from the microtask T246013 How else we can contribute to this project? I am new to open source and outreachy so I don't know much about how to make multiple contributions Thanks in advance

Srija616 added a comment.EditedMar 14 2020, 8:29 AM

Hey all! Had a doubt for revid_to_qid function. Is wikibase item same as qid?

Also, I was doing the first task, revid_to_qid, here the rev_id depends on the URL, right. So given a URL, there can be multiple revids and we can select any revid to get the corresponding qid. Here, again I am unable to get the language. I have attached a screen shot which shows "language" in "general" but I cannot figure out how to extract it.

@Srija616, as mentioned in the instructions the revision ID, is language-specific, so if the revision is from English Wikipedia, the page props

API that is called must be the one associated with English Wikipedia. You have to input the language, not the other way around. See def revid_to_qid(revid, lang): language is an argument. The user has to specify it. I hope this solves your issue!!
Isaac added a comment.Mar 15 2020, 3:27 PM

How else we can contribute to this project?

@Tanvi_Singhal Submitting this notebook is the single way to contribute to this project in this application phase. It is not required, but feel free to submit multiple versions of the notebook if you would like to indicate that you're working on the project but continue to expand on your code and analyses.

Apxxxva784 added a subscriber: Apxxxva784.EditedMar 15 2020, 7:08 PM

Hey All! Im Apoorva, a second-year undergrad from DTU, India. I have some experience with Python and I'm new to Open Source. Looking forward to having a great time contributing to this project and learning in the process!

@Srija616, as mentioned in the instructions the revision ID, is language-specific, so if the revision is from English Wikipedia, the page props

API that is called must be the one associated with English Wikipedia. You have to input the language, not the other way around. See def revid_to_qid(revid, lang): language is an argument. The user has to specify it. I hope this solves your issue!!

Hey @Tanvi_Singhal Thanks!

Isaac added a comment.Mar 17 2020, 3:22 PM

Hey everyone -- we've got a lot of contributors, which is absolutely fantastic! To keep the support manageable and pool from growing too large, I'm planning to close this task to new contributors in two days (on Thursday 19 March at 6pm UTC). If you are interested in submitting an application for this task, you'll need to submit at least an initial contribution through the Outreachy portal by then. You'll still be able to work on the task through the April 7th deadline if you submit that initial contributions. Let me know if you have any questions.

Isaac updated the task description. (Show Details)Mar 17 2020, 3:23 PM
Isaac updated the task description. (Show Details)Mar 19 2020, 10:37 PM

Hi all! I am still stuck on the 1st micro task. I've spent some time reviewing the definitions of revid and qid but still I don't understand what I have to do next. If someone could help I would greatly appreciate it.
Thanks in advance

Hi all! I am still stuck on the 1st micro task. I've spent some time reviewing the definitions of revid and qid but still I don't understand what I have to do next. If someone could help I would greatly appreciate it.
Thanks in advance

Hey! revid is a unique code that determines a specific wikipedia article, the qid is code for it's corresponding wikipedia article. You basically have to find out the qid using the revid. The mediaWiki API is to be used, you'll send the revid as a parameter, extract a json string and furthur work on it to find the required qid. I hope this helps :)

Heyy! Revid ID is uniquely associated with a wikipedia article whereas QID is associated with the corresponding wikidata item. You can refer to https://www.mediawiki.org/wiki/API:Pageprops to understand how to use the MediaWiki pageprops API. You can send Revid ID as a parameter with the request and get the JSON response which you can parse to obtain the QID.
You can read this https://www.wikidata.org/wiki/Wikidata:Introduction for getting a brief understanding about wikidata.
Hope this helps.

Sumit added a subscriber: Sumit.May 7 2020, 6:51 AM

I'm curious to know what kind of features is the Wikidata topic models API using? Is it the same features as the original topic model developed for English Wikipedia or something different?

Halfak added a comment.May 7 2020, 8:49 PM

The gist is to convert a wikidata item to a sequence of properties and values. E.g.

P31 Q3
P21 Q6581097
...

Then we learn word embeddings from using the properties and values as "words". Otherwise we follow a similar process for building the Wikidata topic model.