Page MenuHomePhabricator

Synchronising Wikidata and Wikipedias using pywikibot - Task 3
Open, Needs TriagePublic

Description

This is the third task for T276329, Synchronising Wikidata and Wikipedias using pywikibot, aimed at getting you familiar with parsing information from Wikipedia articles.

  1. You should already have a Wikimedia account and set up pywikibot (if not, do Tasks 1 and 2 first).
  1. Set up a script that will connect to Wikipedia, and load the contents of one of the pages you identified in Task 1 (just one for now).
  1. Parse through the article text to extract the statements you manually found in Task 1. Use whichever tool you would like for this (e.g., 're', or searching for template parameter names in the infobox, etc.). (Yes - this is the first tricky part!)
  1. Print out the information (extracted from the article, not Wikidata!) alongside the property name (e.g., "P31 = radio telescope". Code this for at least 6 items in the current article.
  1. Try for a few other pages as well - how well does your parsing work, and what changes do you need to make for the other pages?

Bonus: print out the corresponding values from Wikidata as well, if they are available.

Save your code to a repository, or create a page like https://www.wikidata.org/wiki/User:Mike_Peel/Outreachy_2 (under your username - and change the ending to '3'.)

Once you are happy, send me a link to your page (by email, on my talk page, or replying to this ticket as you prefer). Make sure to also register it as a contribution on the Outreachy website (https://www.outreachy.org/outreachy-may-2021-internship-round/communities/wikimedia/synchronising-wikidata-and-wikipedias-using-pywiki/contributions/)!

Hints:

Event Timeline

@Mike_Peel Concerning the instruction, " load the contents of one of the pages ", is this referring to a single/one article page? Please provide more explanation

@Mike_Peel Concerning the instruction, " load the contents of one of the pages ", is this referring to a single/one article page? Please provide more explanation

I've modified the task, is that clearer? One page to start, but then also try some of the other pages to see how your code works in different articles.

Parse through the article text to extract the statements you manually found in Task 1. Use whichever tool you would like for this (e.g., 're', or searching for template parameter names in the infobox, etc.).

@Mike_Peel can we use any parsing tool/library for this?

Parse through the article text to extract the statements you manually found in Task 1. Use whichever tool you would like for this (e.g., 're', or searching for template parameter names in the infobox, etc.).

@Mike_Peel can we use any parsing tool/library for this?

Yes.

Parse through the article text to extract the statements you manually found in Task 1. Use whichever tool you would like for this (e.g., 're', or searching for template parameter names in the infobox, etc.).

@Mike_Peel can we use any parsing tool/library for this?

Yes.

Alright, thanks

This comment was removed by Tru2198.

Hi @Mike_Peel I have tried implementing the 3rd task https://www.wikidata.org/wiki/User:Srishti0gupta/outreachy_3

I was able to complete it for Properties(P Number) which had Q Number and Images

However, I feel this strategy I used isn't good, because whenever the P Numbers will not have a QNumber/Image this won't work. Maybe, finding where the Property has a value (that contains a link) or is plain text should be a deciding factor. So that, after retrieving link, may go that page and extract the title, or print the plain text.

Please share, if there is any other better approach. Also, whether this is sufficient for this task.

Awaiting your feedback.

@Shristi0gupta , wonderful program! I wanted to know how did you deal with the parameter values who didn't have label? and also, some values without Qnumbers like dates?

Thanks!

Hi @Tru2198, I didn't implement it for such cases yet, except Image.

I believe finding href html tag and then asking for title of the page for that link will resolve this issue.

And when the programme couldn't find a href, then must be a plain text, which should be extracted from relevant <div> tags.

If there is any better approach, please share. Because

Hello @Mike_Peel, @MSGJ
I have completed Task_3. Though, I haven't used the parsing through the regex.

Here is the link to my task page.

https://www.wikidata.org/wiki/User:Tru2198/Outreachy_3

Kindly provide suggestions on if this approach is viable and other feedbacks.

Thank you!

Hi @Tru2198, I didn't implement it for such cases yet, except Image.

I believe finding href html tag and then asking for title of the page for that link will resolve this issue.

And when the programme couldn't find a href, then must be a plain text, which should be extracted from relevant <div> tags.

If there is any better approach, please share. Because

Cool way. Though I didn't use re. If I am getting a green flag on mine, I'll surely help you out. My link is above though, not sure if it's workable. Also, on your page, you have mentioned "sparesite()" in the last line. It's a spelling error I think. Do check it out!

Hi @MSGJ, I have completed task_3 which awaits your feedback!
Also, I think this tutorial is also ideal for task_4 that needs adding information to the wiki data.

Thank you!

@Mike_Peel So we have to print from both Wikipedia and wikidata? Maybe I mistakenly did only the bonus part! Thank you

“Print out the information alongside the property name (e.g., "P31 = human"). “
Isn't this only in wikidata, as wikidata stores the information in the form of properties and QIDs?

So, how should the statements be printed on parsing Wikipedia? I am confused here.

Thank you for your time!

@Mike_Peel
@MSGJ

“Print out the information alongside the property name (e.g., "P31 = human"). “
Isn't this only in wikidata, as wikidata stores the information in the form of properties and QIDs?

So, how should the statements be printed on parsing Wikipedia? I am confused here.

Also, do review by task_2:
https://www.wikidata.org/w/index.php?title=User:Tru2198/Outreachy_2

Thank you!

Thank you for your time!

For those working on this task, there is a useful tutorial here: https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial

Hi @MSGJ, Thanks for sharing this. Still going through it and learning.

@Tru2198 and @Srishti0gupta I have been trying to write a parsing function but could any of you help me out with the steps involved in writing the same? Parsing is something I haven't tried before.
I'd be highly grateful to you.

@Mike_Peel So we have to print from both Wikipedia and wikidata? Maybe I mistakenly did only the bonus part! Thank you

Sorry for the confusion, I've tried to clarify the task today. The main aim of this task is to extract information from the Wikipedia article. @Srishti0gupta @AnkitaxPriya please also check that. :-) You shouldn't need to scrape the HTML code, since you can get page.text for the Wikipedia article through pywikibot (as per task 2).

@AnkitaxPriya what are you looking for in the article? It is a template parameter, an identifier on an external link, a category or something in the text itself? If you can be more specific about what you're trying to do then I will try and help.

@Mike_Peel So we have to print from both Wikipedia and wikidata? Maybe I mistakenly did only the bonus part! Thank you

Sorry for the confusion, I've tried to clarify the task today. The main aim of this task is to extract information from the Wikipedia article. @Srishti0gupta @AnkitaxPriya please also check that. :-) You shouldn't need to scrape the HTML code, since you can get page.text for the Wikipedia article through pywikibot (as per task 2).

@Mike_Peel Thanks

@Tru2198 and @Srishti0gupta I have been trying to write a parsing function but could any of you help me out with the steps involved in writing the same? Parsing is something I haven't tried before.
I'd be highly grateful to you.

@AnkitaxPriya I have re-started working on the code as I interpreted in a different way earlier, once I myself get to it, will surely help you.

@Mike_Peel So we have to print from both Wikipedia and wikidata? Maybe I mistakenly did only the bonus part! Thank you

Sorry for the confusion, I've tried to clarify the task today. The main aim of this task is to extract information from the Wikipedia article. @Srishti0gupta @AnkitaxPriya please also check that. :-) You shouldn't need to scrape the HTML code, since you can get page.text for the Wikipedia article through pywikibot (as per task 2).

Thank you for making it clear @Mike_Peel. I will try to work on it now.

@AnkitaxPriya what are you looking for in the article? It is a template parameter, an identifier on an external link, a category or something in the text itself? If you can be more specific about what you're trying to do then I will try and help.

@MSGJ I am trying to write a Python code for parsing WikiData site to extract property parameters and their respective Q-values.

@Tru2198 and @Srishti0gupta I have been trying to write a parsing function but could any of you help me out with the steps involved in writing the same? Parsing is something I haven't tried before.
I'd be highly grateful to you.

@AnkitaxPriya I have re-started working on the code as I interpreted in a different way earlier, once I myself get to it, will surely help you.

Thank you @Srishti0gupta

@AnkitaxPriya what are you looking for in the article? It is a template parameter, an identifier on an external link, a category or something in the text itself? If you can be more specific about what you're trying to do then I will try and help.

@MSGJ I am trying to write a Python code for parsing WikiData site to extract property parameters and their respective Q-values.

You should just be able to use page.get() for this from the Wikidata item, and work through the returned dictionary - however the main aim of the task is to use page.text from the Wikipedia page and to search through that for the info.

Since we are parsing through the article text on wikipedia, are we also meant to get the property id as shown in the example?

"P31 = radio telescope"

If so, how is this possible without accessing wikidata?

Since we are parsing through the article text on wikipedia, are we also meant to get the property id as shown in the example?

"P31 = radio telescope"

If so, how is this possible without accessing wikidata?

Hello @Aminehassou ! What I think that we have to do is that, we search for 'specific information' (especially those that we searched for in Task 1) in the Wikipedia article. And since we know what kind of information we searched for, we can assign the corresponding property ID which is appropriate for it.

Since we are parsing through the article text on wikipedia, are we also meant to get the property id as shown in the example?

"P31 = radio telescope"

If so, how is this possible without accessing wikidata?

Hello @Aminehassou ! What I think that we have to do is that, we search for 'specific information' (especially those that we searched for in Task 1) in the Wikipedia article. And since we know what kind of information we searched for, we can assign the corresponding property ID which is appropriate for it.

Hey, thanks for the response, so in other words, we hardcode the value in since we have already found it during task 1?

Since we are parsing through the article text on wikipedia, are we also meant to get the property id as shown in the example?

"P31 = radio telescope"

If so, how is this possible without accessing wikidata?

Hello @Aminehassou ! What I think that we have to do is that, we search for 'specific information' (especially those that we searched for in Task 1) in the Wikipedia article. And since we know what kind of information we searched for, we can assign the corresponding property ID which is appropriate for it.

Hey, thanks for the response, so in other words, we hardcode the value in since we have already found it during task 1?

No problem. And yes, that's pretty much what we have to do.

Since we are parsing through the article text on wikipedia, are we also meant to get the property id as shown in the example?

"P31 = radio telescope"

If so, how is this possible without accessing wikidata?

Hello @Aminehassou ! What I think that we have to do is that, we search for 'specific information' (especially those that we searched for in Task 1) in the Wikipedia article. And since we know what kind of information we searched for, we can assign the corresponding property ID which is appropriate for it.

Hey, thanks for the response, so in other words, we hardcode the value in since we have already found it during task 1?

No problem. And yes, that's pretty much what we have to do.

Not quite. You hard-code 'P31' ('instance of'), since that is what you are looking for. But then you extract 'radio telescope' directly from the article (e.g., by retrieving the "Lovell Telescope" article and finding "is a [[radio telescope]]" in the first sentence). So you hard-code the question ('what is this an instance of?'), find the answer from the article by python, and then get your code to print it out in terms of Wikidata value 'P31 = value".

Since we are parsing through the article text on wikipedia, are we also meant to get the property id as shown in the example?

"P31 = radio telescope"

If so, how is this possible without accessing wikidata?

Hello @Aminehassou ! What I think that we have to do is that, we search for 'specific information' (especially those that we searched for in Task 1) in the Wikipedia article. And since we know what kind of information we searched for, we can assign the corresponding property ID which is appropriate for it.

Hey, thanks for the response, so in other words, we hardcode the value in since we have already found it during task 1?

No problem. And yes, that's pretty much what we have to do.

Not quite. You hard-code 'P31' ('instance of'), since that is what you are looking for. But then you extract 'radio telescope' directly from the article (e.g., by retrieving the "Lovell Telescope" article and finding "is a [[radio telescope]]" in the first sentence). So you hard-code the question ('what is this an instance of?'), find the answer from the article by python, and then get your code to print it out in terms of Wikidata value 'P31 = value".

I misunderstood 'value' as the Property ID for that element. Apologies for the misunderstanding, @Aminehassou and @Mike_Peel

Since we are parsing through the article text on wikipedia, are we also meant to get the property id as shown in the example?

"P31 = radio telescope"

If so, how is this possible without accessing wikidata?

Hello @Aminehassou ! What I think that we have to do is that, we search for 'specific information' (especially those that we searched for in Task 1) in the Wikipedia article. And since we know what kind of information we searched for, we can assign the corresponding property ID which is appropriate for it.

Hey, thanks for the response, so in other words, we hardcode the value in since we have already found it during task 1?

No problem. And yes, that's pretty much what we have to do.

Not quite. You hard-code 'P31' ('instance of'), since that is what you are looking for. But then you extract 'radio telescope' directly from the article (e.g., by retrieving the "Lovell Telescope" article and finding "is a [[radio telescope]]" in the first sentence). So you hard-code the question ('what is this an instance of?'), find the answer from the article by python, and then get your code to print it out in terms of Wikidata value 'P31 = value".

Ok, thank you for the clarification.

[When replying, please strip unneeded full quotes to keep things readable. Thanks.]

I submitted my task 3 code through email, looking forward to the feedback!

Greetings everyone, @Mike_Peel @MSGJ . I have implemented this task to the best of my present understanding and I am happy with the results. I look forward to your review: https://www.wikidata.org/wiki/User:Tambe_Tabitha/Outreachy_3 .