Approved license
I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works
- Yes
No proprietary software:
I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.
- Yes
How long has your team been accepting publicly submitted contributions?
- 1 year
How many regular contributors does your team have?
- 1-2 people
Brief summary
Wikidata is a structured data repository linked to Wikipedia and the other Wikimedia projects. It holds structured data about a huge number of concepts, including every topic covered by a Wikipedia article, and many scientific papers and other topics. It also includes the interlanguage links between Wikipedia articles in different languages, links from Wikipedia to Commons, and between other Wikimedia projects.
It was started by importing all Wikipedia interwiki links, and has been steadily expanding since. However, when a new Wikipedia article is started, it is not automatically matched to Wikidata items, nor is a new item created for it. For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items.
In this project you match new articles against existing Wikidata items using ancillary data (such as identifiers that are common in both the Wikipedia article and Wikidata entry). You will start with existing Python scripts, which use the 'pywikibot' package to edit Wikidata, and significantly expand them to handle more situations automatically. This code will then be used live to create new Wikidata items, replacing the existing scripts.
If there is time, you will also expand it to work with matching categories/articles from other Wikimedia projects, such as Wikimedia Commons or Wikisource, and/or look into creating a Wikidata Game that people can play to add sitelinks in cases where it's less clear for an automated tool.
This project is mentored by Mike Peel. Knowledge of Python is an advantage, although it can be learnt during the project. Knowledge of machine learning techniques might be useful (but this can also be achieved with non-ML approaches). Knowing multiple human languages is useful to work with multiple Wikipedia language communities, but is not required.
Minimum system requirements
You will need a computer with a working Python 3 installation; you can install pywikibot and other useful modules using standard package systems.
How can applicants make a contribution to your project?
You will start by understanding how Wikidata works, looking through Wikipedia articles and seeing how the information is stored on Wikidata. From there you will identify patterns can be used to match articles with the Wikidata item if that link did not already exist. You will then code up automated matching functions and test how well they will work with currently unmatched articles. Ultimately, these will be integrated into the live code to keep Wikidata and the different language Wikipedias in sync.
You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .
Repository
https://bitbucket.org/mikepeel/wikicode/
Issue tracker
N/A
Tasks
There are three 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, but you don't have to do them in order (except you must do task 1 first!), and you don't have to do all of them.
- T290719 Look through a class of articles (e.g., books, authors, games, etc.) and identify what information is in common between Wikipedia and Wikidata
- T290720 Set up pywikibot on your computer, and understand how it interacts with Wikidata
- T290721 Write a function that searches Wikidata for a match to a term
These tasks also form the start of the main project, which will integrate these into an automated script that adds sitelinks to new articles to existing items, or creates new items where there is no match.
Application and timeline
The Outreachy positions are assessed solely on the contributions and the application you submit for the project; the best things you can do are to do well with the contributions, and include all relevant information in your application. Contributions are evaluated based on their completeness, coding style, and any additional work beyond the core task. I generally look for applicants who have demonstrated that they understand the tasks and the Wikimedia community.
When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:
- You should split the timeline into periods, e.g., weekly or two-weekly, and write a short summary of what you expect to be doing in that period.
- The aim of the project is to match all new Wikipedia articles with Wikidata items, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
- Large runs to add sitelinks will need bot approval (can be 2 weeks, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
- Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
- If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!
There are no community specific questions to answer in your application for this project. If you can demonstrate general knowledge of the community, or previous python coding activities in your application, that will will be really helpful.
Also, please bear in mind that I can only accept one intern for this project, so I would strongly recommend contributing to multiple Outreachy projects (particularly those with few applicants) to increase your chances of getting an internship.
Benefits
You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata, and how it relates to Wikipedia articles.
Community benefits
More interwiki links with Wikipedia articles. Fewer duplicate Wikidata items that need to be merged.
Questions?
Please feel free to ask questions in this phabricator task, or in the subtasks. You can also email me if you want (my address is available via Outreachy)