IMPORTANT: Make sure to read the [Outreachy participant instructions](https://www.mediawiki.org/wiki/Outreachy/Participants) and [communication guidelines](https://www.mediawiki.org/wiki/New_Developers/Communication_tips) thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on [Zulip](https://www.mediawiki.org/wiki/Outreach_programs/Zulip) first!
===Approved license
I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works
* Yes
=== No proprietary software:
I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.
* Yes
=== How long has your team been accepting publicly submitted contributions?
* None
=== How many regular contributors does your team have?
* 1-2 people
===Brief summary
Wikidata is a structured data repository linked to Wikipedia and the other Wikimedia projects. It holds structured data about a huge number of concepts, including every topic covered by a Wikipedia article, and many scientific papers and other topics.. It also includes the interlanguage links between Wikipedia articles in different languages, links from Wikipedia to Commons, and between other Wikimedia projects.
It was started by importing all Wikipedia interwiki links, and has been steadily expanding since. However, when a new Wikipedia article is started, it is not automatically matched to Wikidata items, nor is a new item created for it. For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items. Additionally, there is a lot of information in Wikipedia articles that has not yet been imported into Wikidata.
In this project you will take the existing Python scripts, which use the 'pywikibot' package to edit Wikidata, and expand them to match new articles against existing Wikidata items using ancillary data (such as identifiers that are common in both the Wikipedia article and Wikidata entry). You will also significantly increase the number of properties that are bot-imported from Wikipedia to Wikidata, and explore concepts like importing references for this information as well. This code will then be used live to create new Wikidata items, replacing the existing scripts.
This project is mentored by Mike Peel. Knowledge of Python is an advantage, although it can be learnt during the project. Knowing multiple human languages is useful to work with multiple Wikipedia language communities, but is not required.
===Minimum system requirements
You will need a working Python 3 installation on your computer.
===How can applicants make a contribution to your project?
You will start by understanding how Wikidata works, looking through Wikipedia articles and seeing how the information is stored on Wikidata. From there you will identify patterns that can be used to automatically import that data, and ways of matching articles with the Wikidata item if that link did not already exist. You will then code up automated matching functions and test how well they will work with currently unmatched articles. Ultimately, these will be integrated into the live code to keep Wikidata and the different language Wikipedias in sync.
You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .
===Repository
https://bitbucket.org/mikepeel/wikicode/
===Issue tracker
N/A
===Intern tasks
* T278860 Look through a class of articles (e.g., books, authors, games, etc.) and identify how the information in the article matches with the Wikidata properties for that topic
* T278863 Set up pywikibot on your computer, and understand how it interacts with Wikidata
* Write a function that reads in a Wikipedia article, parses it for specific information, and prints it to screen
* Write a function that writes new structured data into Wikidata (ideally linked to the function above)
* Write a function that searches Wikidata for a match to a new article
* Integrate these into an automated script that adds sitelinks to existing items, or creates new items where there is no match
=== Intern benefits
You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata, and how it relates to Wikipedia articles.
=== Community benefits
More interwiki links with Wikipedia articles. Fewer duplicate Wikidata items that need to be merged. Significantly more content from Wikipedia available through Wikidata.