How do readers explore the content of Wikipedia when learning about a specific topic and going down the Wikipedia rabbit-hole? Existing tools such as the pageviews-tool give a rough proxy on the popularity of individual articles, however, they do not provide any information about how readers navigate through the millions of different articles. One valuable resource for this question is the wikipedia-clickstream data, a public dataset about how frequently readers navigate certain links. This data is available as a monthly updated dump file for 11 languages. In the past years, researchers have used these datasets extensively showing, for example, that only a small fraction of all links in Wikipedia is actually used by readers. However, the data is currently not easily accessible without substantial technical skills in data wrangling. In this project, starting from the publicly available data, we want to build a simple API which allows to easily access and analyze how readers navigate wikipedia articles and topics.
Specifically, the work will the following phases:
- Becoming familiar with the clickstream data, i.e. how to download, pre-process and filter the data, what is the data structure etc
- Analysis and Visualization of navigation to (incoming) and from (outgoing) for a given input article.
- Build a simple interface for people to test the tool -- e.g., similar to the list-building tool to find related articles.
- Augmenting data with additional features, such as the topics of articles, or the structure of the link-network (all existing incoming and outgoing links)
- Python for modeling and data science
Nice to have but willingness to learn is sufficient
- Jupyter notebooks for documentation and visualization of data
- Any skills in HTML/CSS/JS and general design will also be useful for building the interface to showcase the model
see T276315 (Outreachy Application Task: Tutorial for Wikipedia Clickstream data)