Page MenuHomePhabricator

Synchronising Wikidata and Wikipedias using pywikibot - Task 6
Closed, ResolvedPublic

Description

This is the sixth task for T276329, Synchronising Wikidata and Wikipedias using pywikibot, aimed at getting you familiar with looping through Wikipedia categories

  1. You should already have a Wikimedia account and set up pywikibot (if not, do Tasks 1 and 2 first).
  1. Look at the subcategories of https://en.wikipedia.org/wiki/Category:Wikipedia_categories_tracking_data_not_in_Wikidata and pick one (say which one you have picked below, so that two people don't pick the same category! And note that the 'date of birth' and 'date of death' categories are not eligible for this task.)
  1. Write a script that loops through that category, retrieving each article, and finding the relevant ID value that is not on Wikidata
  1. Save the IDs to wikidata!
  1. Bonus: check that there is not a Wikidata item that already has the same ID, and if there is, investigate what has happened.

Save your code to a repository, or create a page like https://www.wikidata.org/wiki/User:Mike_Peel/Outreachy_2 (under your username - and change the ending to '6'.) Add the links to the edits at the end of the code as a comment.

Once you are happy, send me a link to your page (by email, on my talk page, or replying to this ticket as you prefer). Make sure to also register it as a contribution on the Outreachy website (https://www.outreachy.org/outreachy-may-2021-internship-round/communities/wikimedia/synchronising-wikidata-and-wikipedias-using-pywiki/contributions/)!

Hints:

  • You can probably reuse code for earlier tasks to do this

Event Timeline

Hello Everyone! I want to try and take up the category of "Category:Date of death not in Wikidata" for this task.

Hello Everyone! I want to try and take up the category of "Category:Date of death not in Wikidata" for this task.

Sorry, I shoud have said that 'Date of death' and 'Date of birth' are not suitable for this task, since they are not IDs, and Pi bot automatically synchrotronises them with Wikidata already - so the only ones in those categories will be mismatches and complicated format strings. Sorry about that!

Hello Everyone! I want to try and take up the category of "Category:Date of death not in Wikidata" for this task.

Sorry, I shoud have said that 'Date of death' and 'Date of birth' are not suitable for this task, since they are not IDs, and Pi bot automatically synchrotronises them with Wikidata already - so the only ones in those categories will be mismatches and complicated format strings. Sorry about that!

I see... Thank you for notifying! In that case, I will be working with Category:ATP template with ID not in Wikidata.

Just for clarification, are we meant to hardcode the ids in? I don't see a way to get the ids through code unless that specific category has an API tied to it.

Just for clarification, are we meant to hardcode the ids in? I don't see a way to get the ids through code unless that specific category has an API tied to it.

You can hard-code the Wikidata property number, but not the ID values - those you should extract from the articles.

What if certain articles don't have the id written anywhere? Do I skip those ones? Or are the articles in the categories hand-picked in a way where every article has to have the id?

What if certain articles don't have the id written anywhere? Do I skip those ones? Or are the articles in the categories hand-picked in a way where every article has to have the id?

The categories should be populated by a template, with an ID value, like {{TemplateName|1234}}. So the task is to find that template, extract the ID value, and then add it to Wikidata. What to do when the template doesn't have the ID, or has a bad ID, is my follow-up question. ;-)

hey everyone,
What do we mean by ID value here?

Thanks,
Pushpanjali Kumari

Hey @Mike_Peel @MSGJ , for the Category:The Interviews name ID not in Wikidata, what do we add as an ID for pages which where 2 or more such IDs are possible, and they've already been added to individual pages. For instance, for Glen and Les Charles, the individual wikidata pages have their correct IDs already added, and the Category:The Interviews name ID not in Wikidata page has not mentioned the individual pages.

So what is supposed to be the correct ID in this case? Is it going to be the relevant ID of Glen Charles (glen-charles) and Les Charles (les-charles)?

hey everyone,
What do we mean by ID value here?

Thanks,
Pushpanjali Kumari

It is referring to the value of an External Identifier.

hey everyone,
What do we mean by ID value here?

Thanks,
Pushpanjali Kumari

It is referring to the value of an External Identifier.

@Poornima7 i am still confused. Can you give one example or clarify it more.

hey everyone,
What do we mean by ID value here?

Thanks,
Pushpanjali Kumari

It is referring to the value of an External Identifier.

@Poornima7 i am still confused. Can you give one example or clarify it more.

Well, if your subcategory is, say, Category:The_Interviews_name_ID_not_in_Wikidata, then your ID can be found by go the specified article page (here: Glen_and_Les_Charles. Slide off to the external links section and click on the: The_Interviews:_An_Oral_History_of_Television. This will open up a Wikipedia page of its own. But you again slid off to the external links and find the official website and migrate to the URL: interviews.televisionacademy.com/interviews/glen-charles.
There you get your Interview ID name, i.e., Glen-Charles.

Your page does not have a number as an ID but a name for the property: P5773. The other way is using templates, but there we have concerns with authenticity.

Here I tried my best, though still confused, ping up anytime!!

hey everyone,
What do we mean by ID value here?

Thanks,
Pushpanjali Kumari

It is referring to the value of an External Identifier.

@Poornima7 i am still confused. Can you give one example or clarify it more.

Well, if your subcategory is, say, Category:The_Interviews_name_ID_not_in_Wikidata, then your ID can be found by go the specified article page (here: Glen_and_Les_Charles. Slide off to the external links section and click on the: The_Interviews:_An_Oral_History_of_Television. This will open up a Wikipedia page of its own. But you again slid off to the external links and find the official website and migrate to the URL: interviews.televisionacademy.com/interviews/glen-charles.
There you get your Interview ID name, i.e., Glen-Charles.

Your page does not have a number as an ID but a name for the property: P5773. The other way is using templates, but there we have concerns with authenticity.

Here I tried my best, though still confused, ping up anytime!!

@Tru2198 So, final task is to include P5773 and https://interviews.televisionacademy.com/interviews/glen-charles on identifiers list of https://en.wikipedia.org/wiki/Glen_and_Les_Charles wikidata page.

You changed your subcategory. So I assume, you got the idea, right?

You changed your subcategory. So I assume, you got the idea, right?

@Tru2198 , i guess yes. It would be better if you could answer my earlier question -
So, final task is to include P5773 and https://interviews.televisionacademy.com/interviews/glen-charles on identifiers list of https://en.wikipedia.org/wiki/Glen_and_Les_Charles wikidata page.

You changed your subcategory. So I assume, you got the idea, right?

@Tru2198, i guess yes. It would be better if you could answer my earlier question -
So, final task is to include P5773 and https://interviews.televisionacademy.com/interviews/glen-charles on identifiers list of https://en.wikipedia.org/wiki/Glen_and_Les_Charles wikidata page.

Yes: You have to add a property, remove an existing erroneous value, if it exists, and then add the new one which is a name. You can check by looking at the category which has the interview ID in their wikidata, and see how it's arranged. But I think, in this case, as it has two people on the same page, you should first check if the existing is correct or not, as it will have two values for a single property.

All the best!

Hey @Tru2198 I got confused a bit. I thought we had to go to each article and search for all wikidata properties that could be added to the wikidata article. Then we would look for the values of those properties from the wikipedia article.

Is my understanding wrong? If so, please correct me. Thanks!

I thought we had to go to each article and search for all wikidata properties that could be added to the wikidata article. Then we would look for the values of those properties from the wikipedia article.

Hello @Anubhuti! As far as I understand, we need to go through all the articles in the chosen category and find the property associated with that particular category only instead of looking for all the possible Wikidata properties.
For example: Let's say you chose the 90minut template with ID not in Wikidata category. On the top, you will notice the following statement:

This category is for Wikipedia articles using the {{90minut}} template with an ID not in Wikidata property 90minut player ID (P3605).

So you need to find the information associated with the "{{90minut}}" (or something similar, depending on the category) in each article and assign it a property ID of P3605 while adding it to Wikidata.

Hope this helps :)

I thought we had to go to each article and search for all wikidata properties that could be added to the wikidata article. Then we would look for the values of those properties from the wikipedia article.

Hello @Anubhuti! As far as I understand, we need to go through all the articles in the chosen category and find the property associated with that particular category only instead of looking for all the possible Wikidata properties.
For example: Let's say you chose the 90minut template with ID not in Wikidata category. On the top, you will notice the following statement:

This category is for Wikipedia articles using the {{90minut}} template with an ID not in Wikidata property 90minut player ID (P3605).

So you need to find the information associated with the "{{90minut}}" (or something similar, depending on the category) in each article and assign it a property ID of P3605 while adding it to Wikidata.

Hope this helps :)

Yes, this is correct. You only need to find and use the authority control ID corresponding to the tracking category - just one Wikidata property.

Hey @Mike_Peel @MSGJ , for the Category:The Interviews name ID not in Wikidata, what do we add as an ID for pages which where 2 or more such IDs are possible, and they've already been added to individual pages. For instance, for Glen and Les Charles, the individual wikidata pages have their correct IDs already added, and the Category:The Interviews name ID not in Wikidata page has not mentioned the individual pages.

So what is supposed to be the correct ID in this case? Is it going to be the relevant ID of Glen Charles (glen-charles) and Les Charles (les-charles)?

You tell me! :-) You may want to include both on Wikidata, or get your code to skip that article

Hey @Mike_Peel @MSGJ , for the Category:The Interviews name ID not in Wikidata, what do we add as an ID for pages which where 2 or more such IDs are possible, and they've already been added to individual pages. For instance, for Glen and Les Charles, the individual wikidata pages have their correct IDs already added, and the Category:The Interviews name ID not in Wikidata page has not mentioned the individual pages.

So what is supposed to be the correct ID in this case? Is it going to be the relevant ID of Glen Charles (glen-charles) and Les Charles (les-charles)?

You tell me! :-) You may want to include both on Wikidata, or get your code to skip that article

Well, I think that the wikidata page Glen and Les Charles is supposed to have the ID of both Glen Charles and Les Charles. Anyhow there is no such ID like glen-and-les-charles (as per the naming convention of this ID type). So I guess Glen and Les Charles wikidata page will have 2 values of this ID : one for Les Charles and one for Glen Charles.

Also, the individual IDs are already present in their respective wikidata pages. So, I believe that since Glen and Les Charles wikidata page comprises of both of them(Glen and Les Charles), adding individual IDs to this page must work.

Hope this is okay!
Thanks

Hey @Mike_Peel @MSGJ , for the Category:The Interviews name ID not in Wikidata, what do we add as an ID for pages which where 2 or more such IDs are possible, and they've already been added to individual pages. For instance, for Glen and Les Charles, the individual wikidata pages have their correct IDs already added, and the Category:The Interviews name ID not in Wikidata page has not mentioned the individual pages.

So what is supposed to be the correct ID in this case? Is it going to be the relevant ID of Glen Charles (glen-charles) and Les Charles (les-charles)?

You tell me! :-) You may want to include both on Wikidata, or get your code to skip that article

Well, I think that the wikidata page Glen and Les Charles is supposed to have the ID of both Glen Charles and Les Charles. Anyhow there is no such ID like glen-and-les-charles (as per the naming convention of this ID type). So I guess Glen and Les Charles wikidata page will have 2 values of this ID : one for Les Charles and one for Glen Charles.

Also, the individual IDs are already present in their respective wikidata pages. So, I believe that since Glen and Les Charles wikidata page comprises of both of them(Glen and Les Charles), adding individual IDs to this page must work.

Hope this is okay!
Thanks

I think yes, you would have to add both, glen-charles and les-charles, both, separately in the same Wikidata page.

Hey @Tru2198 I got confused a bit. I thought we had to go to each article and search for all wikidata properties that could be added to the wikidata article. Then we would look for the values of those properties from the wikipedia article.

Is my understanding wrong? If so, please correct me. Thanks!

I believe, that is the task_4 mostly.

Thank you all for explaining !

@Mike_Peel are there any community specific questions that we need to answer for our final application?

@Mike_Peel are there any community specific questions that we need to answer for our final application?

Not for this project.

Hi everyone,
What do we need to fill in Outreachy internship project timeline?

Regards
Pushpanjali Kumari

What do we need to fill in Outreachy internship project timeline?

I put guidance for this at T276329 under the task list.

What do we need to fill in Outreachy internship project timeline?

I put guidance for this at T276329 under the task list.

thank you

Hi everyone,

What is the project contribution deadline? I am little bit confused after getting mail about final application deadline extension.

Regards
Pushpanjali Kumari

What is the project contribution deadline? I am little bit confused after getting mail about final application deadline extension.

It's not entirely clear to me. What I just got told was:

  • May 3 at 7pm UTC - final application deadline
  • May 9 - contribution deadline

So maybe you can continue editing/adding contributions after the application deadline - @srishakatux? But I would prefer if contributions for this project were submitted by the application deadline (3 May).

This comment was removed by Pushp24.

Hi everyone,

I am also working on https://en.wikipedia.org/wiki/Category:Canada_Soccer_player_ID_not_in_Wikidata

Hi all, I have worked on Category:Guardian_topic_ID_not_in_Wikidata.