IMPORTANT: Make sure to read the [Outreachy participant instructions](https://www.mediawiki.org/wiki/Outreachy/Participants) and [communication guidelines](https://www.mediawiki.org/wiki/New_Developers/Communication_tips) thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on [Zulip](https://www.mediawiki.org/wiki/Outreach_programs/Zulip) first!
===Brief summary
How do readers explore the content of Wikipedia when learning about a specific topic and going down the [[ https://en.wikipedia.org/wiki/Wiki_rabbit_hole | Wikipedia rabbit-hole ]]? Existing tools such as the [[ https://pageviews.toolforge.org | pageviews-tool ]] give a rough proxy on the popularity of individual articles, however, they do not provide any information about how readers navigate through the millions of different articles. One valuable resource for this question is the [[ https://diff.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/ | wikipedia-clickstream ]] data, a public dataset about how frequently readers navigate certain links. This data is available as [[ https://dumps.wikimedia.org/other/clickstream/readme.html | a monthly updated dump file ]] for 11 languages. In the past years, researchers have used these datasets extensively showing, for example, that [[ https://arxiv.org/abs/1611.02508 | only a small fraction of all links in Wikipedia is actually used ]] by readers. However, the data is currently not easily accessible without substantial technical skills in data wrangling. In this project, starting from the publicly available data, we want to build a simple API which allows to easily access and analyze how readers navigate wikipedia articles and topics.
Specifically, the work will the following phases:
* Becoming familiar with the clickstream data, i.e. how to download, pre-process and filter the data, what is the data structure etc
* Analysis and Visualization of navigation to (incoming) and from (outgoing) for a given input article.
* Build a simple interface for people to test the tool -- e.g., similar to the list-building tool to find related articles.
* Augmenting data with additional features, such as the topics of articles, or the structure of the link-network (all existing incoming and outgoing links)
===Skills required
* Python for modeling and data science
====Nice to have but willingness to learn is sufficient
* Jupyter notebooks for documentation and visualization of data
* Any skills in HTML/CSS/JS and general design will also be useful for building the interface to showcase the model
===Possible mentor(s)
@MGerlach , @Isaac
===Microtasks
see T276315 (Outreachy Application Task: Tutorial for Wikipedia Clickstream data)