IMPORTANT: Make sure to read the [Outreachy participant instructions](https://www.mediawiki.org/wiki/Outreachy/Participants) and [communication guidelines](https://www.mediawiki.org/wiki/New_Developers/Communication_tips) thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on [Zulip](https://www.mediawiki.org/wiki/Outreach_programs/Zulip) first!
===Approved license
I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works
* Yes
=== No proprietary software:
I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.
* Yes
=== How long has your team been accepting publicly submitted contributions?
* 1 year
=== How many regular contributors does your team have?
* 1-2 people
===Brief summary
Wikidata is a structured data repository linked to Wikipedia and the other Wikimedia projects. It holds structured data about a huge number of concepts, including every topic covered by a Wikipedia article, and many scientific papers and other topics. It also includes the interlanguage links between Wikipedia articles in different languages, links from Wikipedia to Commons, and between other Wikimedia projects.
It was started by importing all Wikipedia interwiki links, and has been steadily expanding since. However, when a new Wikipedia article is started, it is not automatically matched to Wikidata items, nor is a new item created for it. For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items.
In this project you match new articles against existing Wikidata items using ancillary data (such as identifiers that are common in both the Wikipedia article and Wikidata entry). You will start with existing Python scripts, which use the 'pywikibot' package to edit Wikidata, and significantly expand them to handle more situations automatically. If there is time, you will also expand it to work with matching categories/articles from other Wikimedia projects, such as Wikimedia Commons or Wikisource. This code will then be used live to create new Wikidata items, replacing the existing scripts.
If there is time, you will also expand it to work with matching categories/articles from other Wikimedia projects, such as Wikimedia Commons or Wikisource, and/or look into creating a Wikidata Game that people can play to add sitelinks in cases where it's less clear for an automated tool.
This project is mentored by Mike Peel. Knowledge of Python is an advantage, although it can be learnt during the project. Knowledge of machine learning techniques might be useful (but this can also be achieved with non-ML approaches). Knowing multiple human languages is useful to work with multiple Wikipedia language communities, but is not required.
===Minimum system requirements
You will need a computer with a working Python 3 installation; you can install pywikibot and other useful modules using standard package systems.
===How can applicants make a contribution to your project?
You will start by understanding how Wikidata works, looking through Wikipedia articles and seeing how the information is stored on Wikidata. From there you will identify patterns can be used to match articles with the Wikidata item if that link did not already exist. You will then code up automated matching functions and test how well they will work with currently unmatched articles. Ultimately, these will be integrated into the live code to keep Wikidata and the different language Wikipedias in sync.
You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .
===Repository
https://bitbucket.org/mikepeel/wikicode/
===Issue tracker
N/A
===Tasks
There are three 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, but you don't have to do them in order (except you must do task 1 first!), and you don't have to do all of them.
1. T290719 Look through a class of articles (e.g., books, authors, games, etc.) and identify what information is in common between Wikipedia and Wikidata
2. T290720 Set up pywikibot on your computer, and understand how it interacts with Wikidata
3. T290721 Write a function that searches Wikidata for a match to a term
These tasks also form the start of the main project, which will integrate these into an automated script that adds sitelinks to new articles to existing items, or creates new items where there is no match.
When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:
* The aim of the project is to match all new Wikipedia articles with Wikidata items, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
* Large runs to import the data will need bot approval (can be 2 weeks, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
* Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
* If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!
There are no community specific questions to answer in your application for this project. If you can demonstrate general knowledge of the community, or previous python coding activities in your application, that will will be really helpful.
=== Benefits
You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata, and how it relates to Wikipedia articles.
=== Community benefits
More interwiki links with Wikipedia articles. Fewer duplicate Wikidata items that need to be merged.
=== Questions?
Please feel free to ask questions in this phabricator task, or in the subtasks. You can also email me if you want (my address is available via Outreachy)