Approved license
I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works
- Yes
No proprietary software:
I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.
- Yes
How long has your team been accepting publicly submitted contributions?
- None
How many regular contributors does your team have?
- 1-2 people
Brief summary
Wikidata is a structured data repository linked to Wikipedia and the other Wikimedia projects. It holds structured data about a huge number of concepts, including every topic covered by a Wikipedia article, and many scientific papers and other topics.. It also includes the interlanguage links between Wikipedia articles in different languages, links from Wikipedia to Commons, and between other Wikimedia projects.
It was started by importing all Wikipedia interwiki links, and has been steadily expanding since. However, when a new Wikipedia article is started, it is not automatically matched to Wikidata items, nor is a new item created for it. For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items. Additionally, there is a lot of information in Wikipedia articles that has not yet been imported into Wikidata.
In this project you will take the existing Python scripts, which use the 'pywikibot' package to edit Wikidata, and expand them to match new articles against existing Wikidata items using ancillary data (such as identifiers that are common in both the Wikipedia article and Wikidata entry). You will also significantly increase the number of properties that are bot-imported from Wikipedia to Wikidata, and explore concepts like importing references for this information as well. This code will then be used live to create new Wikidata items, replacing the existing scripts.
This project is mentored by Mike Peel. Knowledge of Python is an advantage, although it can be learnt during the project. Knowing multiple human languages is useful to work with multiple Wikipedia language communities, but is not required.
Minimum system requirements
You will need a working Python 3 installation on your computer.
How can applicants make a contribution to your project?
You will start by understanding how Wikidata works, looking through Wikipedia articles and seeing how the information is stored on Wikidata. From there you will identify patterns that can be used to automatically import that data, and ways of matching articles with the Wikidata item if that link did not already exist. You will then code up automated matching functions and test how well they will work with currently unmatched articles. Ultimately, these will be integrated into the live code to keep Wikidata and the different language Wikipedias in sync.
You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .
Repository
https://bitbucket.org/mikepeel/wikicode/
Issue tracker
N/A
Tasks
There are six 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, but you don't have to do them in order (except you must do task 1 first!), and you don't have to do all of them.
- T278860 Look through a class of articles (e.g., books, authors, games, etc.) and identify how the information in the article matches with the Wikidata properties for that topic
- T278863 Set up pywikibot on your computer, and understand how it interacts with Wikidata
- T278997 Write a function that reads in a Wikipedia article, parses it for specific information, and prints it to screen
- T279288 Write a function that writes new structured data into Wikidata (ideally linked to the function above)
- T279289 Write a function that searches Wikidata for a match to a term
- T279290 Loop through a Wikipedia tracking category and import IDs into Wikidata
These tasks also form the start of the main project, which will integrate these into an automated script that adds sitelinks to new articles to existing items, or creates new items where there is no match, and imports as much data (and references) as possible from Wikipedia articles into Wikidata.
When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:
- The aim of the project is to import data from Wikipedia into Wikidata, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
- Large runs to import the data will need bot approval (can be 1 week, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
- Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
- If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!
There are no community specific questions to answer in your application for this project.
Benefits
You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata, and how it relates to Wikipedia articles.
Community benefits
More interwiki links with Wikipedia articles. Fewer duplicate Wikidata items that need to be merged. Significantly more content from Wikipedia available through Wikidata.