Page MenuHomePhabricator

Synchronising Wikidata and Wikipedias using pywikibot
Closed, ResolvedPublic

Description

IMPORTANT: Make sure to read the Outreachy participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Approved license

I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works

  • Yes

No proprietary software:

I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.

  • Yes

How long has your team been accepting publicly submitted contributions?

  • None

How many regular contributors does your team have?

  • 1-2 people

Brief summary

Wikidata is a structured data repository linked to Wikipedia and the other Wikimedia projects. It holds structured data about a huge number of concepts, including every topic covered by a Wikipedia article, and many scientific papers and other topics.. It also includes the interlanguage links between Wikipedia articles in different languages, links from Wikipedia to Commons, and between other Wikimedia projects.

It was started by importing all Wikipedia interwiki links, and has been steadily expanding since. However, when a new Wikipedia article is started, it is not automatically matched to Wikidata items, nor is a new item created for it. For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items. Additionally, there is a lot of information in Wikipedia articles that has not yet been imported into Wikidata.

In this project you will take the existing Python scripts, which use the 'pywikibot' package to edit Wikidata, and expand them to match new articles against existing Wikidata items using ancillary data (such as identifiers that are common in both the Wikipedia article and Wikidata entry). You will also significantly increase the number of properties that are bot-imported from Wikipedia to Wikidata, and explore concepts like importing references for this information as well. This code will then be used live to create new Wikidata items, replacing the existing scripts.

This project is mentored by Mike Peel. Knowledge of Python is an advantage, although it can be learnt during the project. Knowing multiple human languages is useful to work with multiple Wikipedia language communities, but is not required.

Minimum system requirements

You will need a working Python 3 installation on your computer.

How can applicants make a contribution to your project?

You will start by understanding how Wikidata works, looking through Wikipedia articles and seeing how the information is stored on Wikidata. From there you will identify patterns that can be used to automatically import that data, and ways of matching articles with the Wikidata item if that link did not already exist. You will then code up automated matching functions and test how well they will work with currently unmatched articles. Ultimately, these will be integrated into the live code to keep Wikidata and the different language Wikipedias in sync.

You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .

Repository

https://bitbucket.org/mikepeel/wikicode/

Issue tracker

N/A

Tasks

There are six 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, but you don't have to do them in order (except you must do task 1 first!), and you don't have to do all of them.

  1. T278860 Look through a class of articles (e.g., books, authors, games, etc.) and identify how the information in the article matches with the Wikidata properties for that topic
  2. T278863 Set up pywikibot on your computer, and understand how it interacts with Wikidata
  3. T278997 Write a function that reads in a Wikipedia article, parses it for specific information, and prints it to screen
  4. T279288 Write a function that writes new structured data into Wikidata (ideally linked to the function above)
  5. T279289 Write a function that searches Wikidata for a match to a term
  6. T279290 Loop through a Wikipedia tracking category and import IDs into Wikidata

These tasks also form the start of the main project, which will integrate these into an automated script that adds sitelinks to new articles to existing items, or creates new items where there is no match, and imports as much data (and references) as possible from Wikipedia articles into Wikidata.

When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:

  • The aim of the project is to import data from Wikipedia into Wikidata, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
  • Large runs to import the data will need bot approval (can be 1 week, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
  • Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
  • If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!

There are no community specific questions to answer in your application for this project.

Benefits

You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata, and how it relates to Wikipedia articles.

Community benefits

More interwiki links with Wikipedia articles. Fewer duplicate Wikidata items that need to be merged. Significantly more content from Wikipedia available through Wikidata.

Event Timeline

srishakatux changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".Mar 4 2021, 9:33 PM
srishakatux moved this task from Backlog to Featured Projects on the Outreachy (Round 22) board.

@srishakatux Thanks for accepting the project! Just to check, is it necessary for this ticket to be private, or will it be made public again later in the process? I ask because I was hoping to point community members to this ticket if needed during the project, but I can set up a wiki page for that instead if we need this as a private space.

@Mike_Peel Yes, we keep this task private for a few days until Outreachy opens the application period to all. I will make it public and inform you here when that happens. https://www.mediawiki.org/wiki/Outreachy/Mentors#_Before_the_program (Step 2)

@Mike_Peel Yes, we keep this task private for a few days until Outreachy opens the application period to all. I will make it public and inform you here when that happens. https://www.mediawiki.org/wiki/Outreachy/Mentors#_Before_the_program (Step 2)

Thanks - that's a bit odd, but as long as it's public again when the project launches, that's fine.

A related question, where should I put the full instructions for doing each of the microtasks? Can I set up separate tickets here, is there a way to enter them on the Outreachy website, or should I write them up on-wiki? (Bear in mind that they can be done by multiple applicants by picking different topic areas, so answers need to be separate from the instructions...)

You can set up tickets in Phabricator, multiple or single, depending on the nature of the task. On the Outreachy site, you can update the project description to include the link to the microtasks on Phabricator.

srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 29 2021, 4:18 PM

There was quite a lot of interest in the project today (I think 15 people?!), so I've set up the guidance for the first two tasks, and hopefully have replied to everyone! Will add more tasks shortly.

@srishakatux Just to check, if there are multiple good candidates at the end of the contribution period, is it possible to accept multiple students, or is it only possible to accept one student? I'm not sure what the rules here are, and it's good to set out expectations at the start of the project.

Hi. I am an Outreachy applicant. I am currently unable to create/register a wikimedia account. Where do I go from here?

Hi @Nonye18 . You can follow this link to register for a wikimedia account
https://www.mediawiki.org/w/index.php?title=Special:CreateAccount&returnto=Help%3ALogging+in
You can share any errors you get, so that i understand what the issue is.

@Nonye18 also, it looks like you've already created your account if you can comment here, try logging in with Nonye18 !

I think I'm up to date with replies now, if I've missed replying to you then please tell me!

This comment was removed by Tru2198.
Mike_Peel updated the task description. (Show Details)

@Tru2198 It's not too late, you can still contribute if you want. I am double-checking, but most likely there will only be one intern for this project (and there has been a lot of interest in it).

@Mike_Peel Thank you for the heads up! No matter the interns, I am really enjoying the learning in the contribution!

@Mike_Peel I sent you an email containing the link to my first and second task solutions but would like to know if I have to send the code I used in the second task

I am double-checking, but most likely there will only be one intern for this project (and there has been a lot of interest in it).

I double-checked, there is only one intern available for this project. There are also two other Wikimedia Outreachy projects, though - see https://phabricator.wikimedia.org/project/view/5105/ !

Since there have been so many applicants, and only one place is available, if you're interested then I would strongly recommend also applying for T276270 - which is also Python and Wikidata related and also has one place available!

Thanks @Mike_Peel for the advice but I prefer this project to the others,
also I have another open source project I'm working on.

Ifeanyi Eze

@Mike_Peel please check my work: https://www.wikidata.org/wiki/User:Ifeanyi_liam/Outreachy_3. I tried to get the P31 property of the Igbo-Ukwu Wikipedia page but it threw an error. I really need to understand how the wikidata item structure work, Any suggestions?

@Mike_Peel please check my work: https://www.wikidata.org/wiki/User:Ifeanyi_liam/Outreachy_3. I tried to get the P31 property of the Igbo-Ukwu Wikipedia page but it threw an error. I really need to understand how the wikidata item structure work, Any suggestions?

I've replied by email - try wk_items['claims']['P31']

I think I'm up to date with giving feedback now, if I've missed your question/post then please let me know!

Greetings.
@Mike_Peel in the project description on the Outreachy website, you stated that

For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items ...

Please can you direct me to where I can find this python script? Is it in the pywikibot repository?. I will like to have a look at it and probably figure out possible avenues of improvement. @MSGJ

Greetings.
@Mike_Peel in the project description on the Outreachy website, you stated that

For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items ...

Please can you direct me to where I can find this python script? Is it in the pywikibot repository?. I will like to have a look at it and probably figure out possible avenues of improvement. @MSGJ

Well, that's the main aim of the whole internship. :-) But see https://bitbucket.org/mikepeel/wikicode/src/master/enwp_find_wikidata.py and https://bitbucket.org/mikepeel/wikicode/src/master/wikidata_new_from_wikipedia_query_article.py for the current codes.

Greetings.
@Mike_Peel in the project description on the Outreachy website, you stated that

For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items ...

Please can you direct me to where I can find this python script? Is it in the pywikibot repository?. I will like to have a look at it and probably figure out possible avenues of improvement. @MSGJ

Well, that's the main aim of the whole internship. :-) But see https://bitbucket.org/mikepeel/wikicode/src/master/enwp_find_wikidata.py and https://bitbucket.org/mikepeel/wikicode/src/master/wikidata_new_from_wikipedia_query_article.py for the current codes.

This is helpful. Thank you !.

Hello!
I had a doubt regarding the bot approval process for importing data. Does our system need to be active throughout the time we wait for the bot to approve the import?

Hello!
I had a doubt regarding the bot approval process for importing data. Does our system need to be active throughout the time we wait for the bot to approve the import?

Normally you do some test edits, then stop and wait for approval, then do the full run. There is also normally some discussion you need to engage in. For an example, see https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Pi_bot_17 (for others, see the links under the 'Task' column at https://www.wikidata.org/wiki/User:Pi_bot ).

Normally you do some test edits, then stop and wait for approval, then do the full run. There is also normally some discussion you need to engage in. For an example, see https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Pi_bot_17 (for others, see the links under the 'Task' column at https://www.wikidata.org/wiki/User:Pi_bot ).

@Mike_Peel Thank you for the explanation.

I had another doubt out of curiosity - in the page for Pi_bot (https://www.wikidata.org/wiki/User:Pi_bot), what do the columns with the times infer to? Are those scripts for running the code automatically at those particular time slots?

I had another doubt out of curiosity - in the page for Pi_bot (https://www.wikidata.org/wiki/User:Pi_bot), what do the columns with the times infer to? Are those scripts for running the code automatically at those particular time slots?

Yes - there are several shell scripts that cron runs automatically at those times of the day. Click on the times and you'll see the shell scripts, and the python script names that they run.

Yes - there are several shell scripts that cron runs automatically at those times of the day. Click on the times and you'll see the shell scripts, and the python script names that they run.

Thank you for the response and explanation :)

Hi. I’m sending this to all applicants of the 'Synchronising Wikidata and Wikipedias using pywikibot’ Outreachy project (T276329). Thanks for all your contributions so far! You can continue sending contributions until the Outreachy deadline. If you haven’t already, please submit your contributions to Outreachy, and you might want to start writing your application if you haven’t already (the deadline is next week!) You can do so here: https://www.outreachy.org/outreachy-may-2021-internship-round/communities/wikimedia/synchronising-wikidata-and-wikipedias-using-pywiki/contributions/ . Also, the project has had a lot of interest, and there is only one place available, so I strongly encourage you to also submit contributions and applications to other Outreachy projects as well to increase your chances of doing an Outreachy internship! (Again, I’m saying this to all applicants.) Thanks.

Hey everyone,
For the Relevant Projects section in the Outreachy Final application, what all sorts of projects can we put? I mean which concepts of Python should they emphasize more on?

Thank you for your time!
@Mike_Peel @MSGJ

Hey everyone,
For the Relevant Projects section in the Outreachy Final application, what all sorts of projects can we put? I mean which concepts of Python should they emphasize more on?

Thank you for your time!
@Mike_Peel @MSGJ

If you have worked on any projects using Wikidata or bots or both, you should include those. On the other hand, you could include a project that shows your aptitude in Python. So any projet with lots of Python code, that you wrote, should be good.

You could also include projects where you worked with Mediawiki data (dumps), if you have done so.
More is better than less. If you think something is relevant, include it.

What Tambe said. :-) Whatever you write doesn't have to be Wikidata specific, but it helps if there's some sort of a connection to the project.

Mike_Peel claimed this task.