Page MenuHomePhabricator

Outreachy Application Task: Simple example of topic classification
Closed, ResolvedPublic

Assigned To
Authored By
Isaac
Feb 24 2020, 4:43 PM
Referenced Files
F31684651: Capture.JPG
Mar 16 2020, 1:45 PM
F31684634: image.png
Mar 16 2020, 1:22 PM
F31684632: image.png
Mar 16 2020, 1:20 PM
F31682464: image.png
Mar 15 2020, 3:09 PM
F31682461: image.png
Mar 15 2020, 3:09 PM
F31682372: Capture.JPG
Mar 15 2020, 2:18 PM
F31682254: image.png
Mar 15 2020, 11:27 AM
F31677775: Screenshot from 2020-03-12 15-55-01.png
Mar 12 2020, 10:32 AM

Description

NOTE: I have closed this task to new contributors as of Thursday 19 March at 10pm UTC. If you have submitted an initial contribution for the task, you may continue to work on the task through the April 7th deadline.

Overview

Create your own PAWS notebook (see set-up below) that completes the template provided in this notebook (https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Outreachy_Example_Wikidata-based_Topic_Classification.ipynb). The full Outreachy project will involve more comprehensive coding than what is being asked for here (and some opportunities for additional explorations as desired), but this task will introduce some of the steps, APIs, and concepts that will be used for that full task.

Set-up

  • Make sure that you can login to the PAWS service with your wiki account: https://paws.wmflabs.org/paws/hub
  • Using this notebook as a starting point, create your own notebook and complete the functions / analyses. All PAWS notebooks have the option of generating a public link, which can be shared back so that we can evaluate what you did. Use a mixture of code cells and markdown to document what you find and your thoughts.
  • As you have questions, feel free to add comments to this task (and answer other people's questions if you can help)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Idadelveloper Could you provide details of how to reproduce that error or any snaps of the error? You usually get this error when you try to retrieve data that does not exist. Maybe check out the JSON structure?

I figured out the problem of the json decode error. Turns out the dictionary of my properties is empty. I do not know why.

@Idadelveloper Wikidata items don't always have values for everything. You'll need to accommodate such cases

Hello.... How would we know if the values we get after performing the cosine similarity are accurate representations of similarity? Also can we try and compare it using any other methods of measuring similarity?

I agree it gives a relative comparison, but is there any dataset of wikipedia articles that are grouped by similarity? If there isn't any, can we try to run this code on a variety of articles and perform some form of similarity-based clustering to come up with the grouping? I looked up for such datasets but couldn't find any.

Hello.... How would we know if the values we get after performing the cosine similarity are accurate representations of similarity? Also can we try and compare it using any other methods of measuring similarity?

You could definitely try different distance metrics and evaluate what differences if any you observe between metric results.

I agree it gives a relative comparison, but is there any dataset of wikipedia articles that are grouped by similarity? If there isn't any, can we try to run this code on a variety of articles and perform some form of similarity-based clustering to come up with the grouping? I looked up for such datasets but couldn't find any.

That's interesting! Will try checking that out too

Hello guys. I clicked on the 'Public Link' of my notebook to generate the public link. That should be the last step in finishing the microtask right?

Hello...is there only one task for us to carry out?

Hi, where we have to submit this task for any reviews?

Thanks all for continuing to support each other over the weekend while I was away.

Hello guys. I clicked on the 'Public Link' of my notebook to generate the public link. That should be the last step in finishing the microtask right?
Hi, where we have to submit this task for any reviews?

@Idadelveloper @Dikshagupta99 You can include the public link when you record your contribution on the Outreachy site. If you have specific questions, I and others can help, but unfortunately we don't have the capacity to review all submissions in full and provide feedback ahead of the deadline.

Hello...is there only one task for us to carry out?

@Idadelveloper yes, just this task. The last component is relatively open-ended though, so feel free to expand it as you see fit.

I'd already completed its microtask but when I headed over to gerrit I couldn't find any issues relevant to python programming (though I could just be missing them)

@Hammad.aamer we have not set up any tasks yet on Gerrit for this project so you weren't missing anything. The project itself will require extensive coding and getting to know how the ORES python framework works but you don't need to learn about that yet.

Has anyone completed the second function? facing some problems.

image.png (1×1 px, 295 KB)

how to fix this error? I'm trying to extract data using MediaWiki API.

how to fix this error? I'm trying to extract data using MediaWiki API.

@Dikshagupta99 looks like the variable p in your code is a list and not a dictionary. So you cant access them using p[i]['mainsbak'] because list indices can be of string type like 'mainsnak'

I'd suggest you examine the type of the variables you're using as well as the json schema and proceed accordingly.

Is the output of function 2 corresponding to 'Q2427544' correct?

image.png (172×1 px, 37 KB)

Is the output of function 2 corresponding to 'Q2427544' correct?

image.png (172×1 px, 37 KB)

Hi, it also says in the task the multiple values of the properties should be represented like this : (P1412, Q7976), (P1412, Q1860)

In the second function to gather wikiData claims for a wikiData item, do we need to also include tuples of properties that appear under another property? for eg- in https://www.wikidata.org/wiki/Q2427544 the property, 'point in time' that appears under 'Hugo Award for Best Novel' which is the value for the property, 'award received'

In the second function to gather wikiData claims for a wikiData item, do we need to also include tuples of properties that appear under another property? for eg- in https://www.wikidata.org/wiki/Q2427544 the property, 'point in time' that appears under 'Hugo Award for Best Novel' which is the value for the property, 'award received'

@Sargamm nope -- just the top-level properties and values are required. You can ignore properties/values under references or other qualifiers like the award received example you gave. Good question.

The Outreachy application asks us to fill a timeline for the internship. How do we fill that out?

The Outreachy application asks us to fill a timeline for the internship. How do we fill that out?

@Mugs-iiit Yeah, I don't expect that any of you will have the information that would be required to fill that out so I won't be judging you on it. I'm much more interested in the notebook code and analysis on similarity at the end. Regarding timeline, it's fair to expect a few weeks to understand how ORES is setup and plan the work followed by several weeks of implementation. That leaves a few weeks still though at the end that is more flexible. I'd ask that you use that question then to give us a better idea of how you would want to use that time -- e.g., whether you'd be more interested in optimizing the engineering, working on improving the actual machine learning model, doing data science work to evaluate the model's performance, or something else.

How do I get page properties or even the page URL using revision ID?

Make sure that you can login to the PAWS service with your wiki account: https://paws.wmflabs.org/paws/hub

Hey, I am not able to login to the PAWS service with my wiki account. I tried the option for resetting password but I didn't receive any mail regarding it. What should I do? Is it really required to complete the task?

In the third function, to convert the claims in a wikiData item into a document embedding, what exactly is the embeddings argument being passed to the function? how are the corresponding embeddings of the properties being accessed?

How do I get page properties or even the page URL using revision ID?

you could start by understanding the sample code given, and how the requests.get() method works. You may use the mediaWiki API which accepts revision ID as a parameter. The json string corresponding to the URL will be returned by using the get() method followed by json(), you could furthur manipulate it to get the desired result.
Hope this helps!

Hey, I am not able to login to the PAWS service with my wiki account.

@Pihu98: Hi, what happens at which steps? Are you logged in on https://meta.wikimedia.org/ ? Please always provide clear steps and error messages. Thanks a lot! :)

Hi @Aklapper !
Yes, I am logged in to https://meta.wikimedia.org/. When I try to login to PAWS service, It says:
"Incorrect username or password entered. Please try again. Usernames are case-sensitive. See phab:T165795 for more details."

Though the username and password which I entered was correct, still I tried the option "Reset Password". But I didn't get any mail regarding resetting my password.

Thanks for your time! :)

Hi, Can someone please provide me with some revids. I want to check if my function1 is working fine with all revids or not.

This comment was removed by Sargamm.

Hi, Can someone please provide me with some revids. I want to check if my function1 is working fine with all revids or not.

Hi, you could search the same topic on Wikipedia and wikidata both, maybe switch to their permanent links, if required( provided in the left column), the URLs will end with their respective revids and qids

@Pihu98 Sorry to hear about the challenges with PAWS. If you are still having trouble getting it to work, if you can get a local installation of Jupyter notebooks working, you are welcome to work on the task there and then upload the notebook to Github or some other hosting service and provide a link to the notebook there in your contribution.

Hi, Can someone please provide me with some revids. I want to check if my function1 is working fine with all revids or not.

@Dikshagupta99 you can also go to the history tab for any Wikipedia article and if you click on the date associated with any of the edits in the history, you'll see a URL parameter called oldid=. The number there is a revision ID. See: https://en.wikipedia.org/wiki/Wikipedia:Revision_id

Hi everyone! I am Janvi Talreja. I am Outreachy 2020 applicant, I would like to work on this project! I have experience in working with machine learning models and I also know NLP. Excited to work with this community.

And If I have completed this task and recorded contributions too. Do I have to submit my contributions for review now only or after the application period gets over.?

what exactly is the datatype of embeddings in task 3? Is it an array of tuples or what? How to access the corresponding embeddings of a particular property?

what exactly is the datatype of embeddings in task 3? Is it an array of tuples or what? How to access the corresponding embeddings of a particular property?

Its a numpy array. You can access it like this:

model = fasttext.load_model('model.bin')
type(model.get_claim_vector('P123'))

Could someone help me with this error?

Screenshot from 2020-03-12 15-55-01.png (1×2 px, 442 KB)

Could someone help me with this error?

@Dibyaaaaax have you made sure that you have the model.bin file available locally in your PAWS instance? fastText does not have particularly descriptive errors unfortunately, but from experience, that is usually the problem. If you don't have the model.bin file, there are instructions in the template I provided for this task that tell you how to get access to it.

And If I have completed this task and recorded contributions too. Do I have to submit my contributions for review now only or after the application period gets over.?

@Dikshagupta99 unfortunately I don't have time to review everyone's full notebook and provide feedback. Feel free to continue to ask specific questions though. You can record contributions at any moment (and I encourage that because it helps me to see how many interested applicants there are) and when the application period is over, I will be reviewing the final submissions.

@Isaac can we contribute in any other way(other than the paws notebook) during the contribution period?
Or are we to record different contributions for each upgrade on this notebook itselt?

This comment was removed by Sargamm.

can we contribute in any other way(other than the paws notebook) during the contribution period? Or are we to record different contributions for each upgrade on this notebook itselt?

@Soniya51 the PAWS notebook is the only way to contribute for this project during this application period. Feel free to record multiple contributions as you go. The last part is intended to be rather open-ended.

@Dikshagupta99 can you share the screenshot of what is coming after opening the link. Also, you can try by downloading VLC player as it is supporting it.

Capture.JPG (602×1 px, 75 KB)

after opening the link, this is coming up

image.png (685×1 px, 118 KB)

and even after opening the file via VLC player, this error is coming up
image.png (496×717 px, 31 KB)

Hi @Dikshagupta99 . Two things:

  • It looks like you were able to successfully download the file to your computer? If that's the case, it is a binary file that can be read by the fastText library. An example of how to do this is in the PAWS notebook I created for this task. No other software on your computer is going to be able to interpret it though so don't worry if your computer cannot identify what type of file it is.
  • If you are having trouble even downloading the file, you can use this Google-Drive-hosted version as well (same file, just different location): https://drive.google.com/file/d/1YAniioZAtMHMMRWbA7HrbuSZVUi5QEpQ/view?usp=sharing

hey, @Tanvi_Singhal, great work! As far as I know, after working on this project, cosine similarity values tell you how much the two documents are similar to each other. For example, I have read the articles https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_the_United_States and https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Poland. If you have noticed, by the end of 15th March Poland counts approx 120-130 coronavirus cases whereas in the USA it's only 22-30.
So I think the result coming out is fine. I'm just telling you my point of view, not sure though.
Is this can be the possible reason @Isaac?

Hi @Tanvi_Singhal @Dikshagupta99, I think you're right, the cosine similarity is very close to 1 and a possible reason for it is because the two entities share the same attributes of data ( as you mentioned). I also was checking similarity between document_embeddings, For example I compared Greenland with a variety of other entities ranging from some unrelated politicians, Greenland's Prime Minister to it's continent(North America) , neighboring country( Canada ) and capital city( Nuuk) . What I found out was that even though Greenland's PM and Greenland are somewhat related, their cosine similarity was just 0.366 and the cosine Similarity with it's continent being 0.726, with it's neighboring country was 0.923 and with it's capital the cosine similarity was 0.955. Also, cosine similarity with some other country( in another continent) like Tanzania, the cosine similarity was 0.76 . So what I am concluding from this is that, similar attributes lead to higher cosine similarity like all these were geographic regions, which was not the case when the country was compared with Personalities.

Am I going right @Isaac ?

Hi, can someone please help me to fix this issue.?

image.png (638×1 px, 75 KB)

and is the fasttext module successfully installed?
image.png (580×1 px, 86 KB)

Hey Sorry for deleting the comment. I thought that's a part of the task so we have to figure that out our own. Also, @Dikshagupta99 Poland and USA are highly related to each other as their value is close to 1. I am resharing the data.

#comparing three pages related to Corona Virus
https://www.wikidata.org/wiki/Q87250695 #poland
https://www.wikidata.org/wiki/Q86694873 #Africa
https://www.wikidata.org/wiki/Q83873577 #USA

  1. cosine of Poland and Africa: 0.19329276862062347
  2. cosine of Africa and USA: 0.16823471008804855
  3. cosine of Poland and USA: 0.9739121575816426

My doubt is why Africa's data is differing from both other countries

@Sargamm Nice observation!! As @Dikshagupta99 mentioned one of the attributes could be the death toll. Let's see other possible attributes that made the value for Africa in both the cases <0.2?

Hi @Tanvi_Singhal @Sargamm will you please help in fixing the error above?

Yes, @Dikshagupta99 I think fasttext module is successfully installed.

You can try uploading the file to the PAWS directory. Hope it will help{F31684651}

ohh okay, thank you :)
actually it was working when I opened on spyder. In jupyter only I was facing this problem. It is sorted now. thanks!

@Dikshagupta99 @Tanvi_Singhal @Sargamm glad to see you applying this to a topic of interest! When trying to understand why the cosine similarity is high or low between two entities, remember to think about what properties/values are defining a given Wikidata item's document embedding. Sometimes the document embeddings will reflect what you expect and sometimes they won't, but you should be able to mostly figure out why from the Wikidata items.

@Tanvi_Singhal the probable reason for the difference of Africa's data could be because Africa is a continent and USA and Poland are countries, so the properties and their values for Africa differ more than USA and Poland. This is what you mean right @Isaac ?

and is the fasttext module successfully installed?

@Dikshagupta99: Hi, please always also mention what you already tried to fix a problem, before asking here. There are many results for fasttext "error: invalid command bdist_wheel" in internet search engines. Please see also Feedback,_questions_and_support. Thanks! :)

Hey all! I seem to be getting a hang of this project and the functions, with trial and error. In the second function, we are looking to have properties and values as outputs. What exactly do those properties and values describe about a particular page/article? What kind of information do they give us?

Hey all! I seem to be getting a hang of this project and the functions, with trial and error. In the second function, we are looking to have properties and values as outputs. What exactly do those properties and values describe about a particular page/article? What kind of information do they give us?

Hey! on the wikidata page for a particular topic, there will be subheading ' Statements' , under that the left column are the properties and the right columns are the corresponding values. you can hover over the hyperlinks to get the P codes and the Q codes (qid ) , if some value does not show a Q code, then the corresponding wikidata page does not exist. Hope this helps :)

Hey all! I seem to be getting a hang of this project and the functions, with trial and error. In the second function, we are looking to have properties and values as outputs. What exactly do those properties and values describe about a particular page/article? What kind of information do they give us?

Hey! on the wikidata page for a particular topic, there will be subheading ' Statements' , under that the left column are the properties and the right columns are the corresponding values. you can hover over the hyperlinks to get the P codes and the Q codes (qid ) , if some value does not show a Q code, then the corresponding wikidata page does not exist. Hope this helps :)

@Sargamm Thanks! But that's not what I am asking. I know how to find those values and properties through the function. I was only wondering what those properties and values stand for

Hey everyone -- we've got a lot of contributors, which is absolutely fantastic! To keep the support manageable and pool from growing too large, I'm planning to close this task to new contributors in two days (on Thursday 19 March at 6pm UTC). If you are interested in submitting an application for this task, you'll need to submit at least an initial contribution through the Outreachy portal by then. You'll still be able to work on the task through the April 7th deadline if you submit that initial contributions. Let me know if you have any questions.

@Sargamm Thanks! But that's not what I am asking. I know how to find those values and properties through the function. I was only wondering what those properties and values stand for

To elaborate on what @Sseidmed said,
If you open any Wikidata item page, you'll find a property - value list called "Statements". The properties can include use, image, subclass of etc, ie, these are used to provide information about an item and have corresponding values.
The properties are denoted by P# (Eg - Image is P18) and the values are denoted by Q#.
For example, on the wikidata page for 'Human', you can find a property 'studied by'(P2579) which has values 'anthropology'(Q23404) and 'human ecology'(Q720858), which suggests that humans are studied by anthropologists and human ecologists.
Hope this gets you a hang of it !

@Sseidmed are you now clear with it? as @Soniya51 just answered. Just adding to it, the Q# are qids for the wikidata entity, that values is linked to.

I was just testing and playing around with different entities, I ended up getting a negative value for the cosine similarity. Has anyone else encountered it? what should this imply?

I was just testing and playing around with different entities, I ended up getting a negative value for the cosine similarity. Has anyone else encountered it? what should this imply?

Yeah, cosine similarity ranges from -1 to 1, negative means the doc embeddings are quite different.

I was just testing and playing around with different entities, I ended up getting a negative value for the cosine similarity. Has anyone else encountered it? what should this imply?

Hey, @Sargamm as @Soniya51 mentioned that cosine similarities range from -1 to 1. Imagine two embeddings as 2 vectors in an n-dimensional space and we are calculating the cosine of the angle between two vectors. From 0 to 180 cos (theta) could range from -1 to 1. If both vectors are aligned in same direction that means they are similar and angle between them is zero degrees and cos(0)=1.

I was just testing and playing around with different entities, I ended up getting a negative value for the cosine similarity. Has anyone else encountered it? what should this imply?

Hey, @Sargamm as @Soniya51 mentioned that cosine similarities range from -1 to 1. Imagine two embeddings as 2 vectors in an n-dimensional space and we are calculating the cosine of the angle between two vectors. From 0 to 180 cos (theta) could range from -1 to 1. If both vectors are aligned in same direction that means they are similar and angle between them is zero degrees and cos(0)=1.

Yeah I knew the mathematical reasoning behind it, was just confirming the inference. Thanks!!

Hello,
@Isaac Is it ok if we submit the PAWS notebook with the work still in progress? I am still working on the last sections of the task but want to continue working on the project. Thanks!

@Kritihyd Yes, I will hold off another hour or two but if you are intending to submit a final application for this project, get your initial contribution in now so you can continue to work on it. You'll be able to include your final work in the final application.

Hi @Isaac, do we have to create PR's also for the contributions? or we just have to add the public links of our jupyter-notebook in the contribution URL?

Hey @Dikshagupta99 no need to create any PRs -- just include the public link for your jupyter notebook like you said. Thanks!

Hello @Isaac do we have to submit a proposal on Wikimedia Phabricator too for this task as mentioned https://www.mediawiki.org/wiki/Outreachy/Participants#Application_process_steps here ( point 11)?

@Tanvi_Singhal if you would like, feel free to but I won't need it for going through applications. Which is to say, so long as all of that information is in your Outreachy final submission, I don't mind whether it exists on Phabricator or not.