Page MenuHomePhabricator

Outreachy Project: Build a tool for analyzing and visualizing reader navigation on wikipedia.
Closed, ResolvedPublic

Description

IMPORTANT: Make sure to read the Outreachy participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Brief summary

How do readers explore the content of Wikipedia when learning about a specific topic and going down the Wikipedia rabbit-hole? Existing tools such as the pageviews-tool give a rough proxy on the popularity of individual articles, however, they do not provide any information about how readers navigate through the millions of different articles. One valuable resource for this question is the wikipedia-clickstream data, a public dataset about how frequently readers navigate certain links. This data is available as a monthly updated dump file for 11 languages. In the past years, researchers have used these datasets extensively showing, for example, that only a small fraction of all links in Wikipedia is actually used by readers. However, the data is currently not easily accessible without substantial technical skills in data wrangling. In this project, starting from the publicly available data, we want to build a simple API which allows to easily access and analyze how readers navigate wikipedia articles and topics.

Specifically, the work will the following phases:

  • Becoming familiar with the clickstream data, i.e. how to download, pre-process and filter the data, what is the data structure etc
  • Analysis and Visualization of navigation to (incoming) and from (outgoing) for a given input article.
  • Build a simple interface for people to test the tool -- e.g., similar to the list-building tool to find related articles.
  • Augmenting data with additional features, such as the topics of articles, or the structure of the link-network (all existing incoming and outgoing links)

Skills required

  • Python for modeling and data science
Nice to have but willingness to learn is sufficient
  • Jupyter notebooks for documentation and visualization of data
  • Any skills in HTML/CSS/JS and general design will also be useful for building the interface to showcase the model

Possible mentor(s)

@MGerlach , @Isaac

Microtasks

see T276315 (Outreachy Application Task: Tutorial for Wikipedia Clickstream data)

Event Timeline

MGerlach changed the visibility from "Public (No Login Required)" to "Outreachy Mentors (Project)".Feb 24 2021, 11:53 AM

@MGerlach @Isaac Whenever you are ready (ideally before the end of this week), follow the steps in Step 3 here https://www.mediawiki.org/wiki/Outreachy/Mentors#_Before_the_program to upload a proposal on the Outreachy site. Let me know if you need any help w/ this.

MGerlach renamed this task from Build a tool for analyzing and visualizing reader navigation on wikipedia. to Outreachy Project: Build a tool for analyzing and visualizing reader navigation on wikipedia..Mar 3 2021, 11:52 AM
MGerlach updated the task description. (Show Details)

@MGerlach @Isaac Whenever you are ready (ideally before the end of this week), follow the steps in Step 3 here https://www.mediawiki.org/wiki/Outreachy/Mentors#_Before_the_program to upload a proposal on the Outreachy site. Let me know if you need any help w/ this.

I submitted the project as a proposal on the outreachy site. @srishakatux Let me know if I missed anything.

@MGerlach Sounds good! I have approved the project.

srishakatux changed the visibility from "Outreachy Mentors (Project)" to "Public (No Login Required)".Mar 29 2021, 4:18 PM

Hi all. the application task is T276315. Dont hesitate to ask questions on that task. I will try to answer open questions but feel free to help each other out too.

Hell myself Preeti Sharma. A machine learning enthusiast and blockchain trainee. I have some queries before taking up the task that is though we have lots of data in clickstream and JSON so can we just scrape the data first then start analyzing it.

Is there any specific way to submit our work for the review and checks??

Is there any specific way to submit our work for the review and checks??

@Leovarmak There is an application task for this project T276315.

MGerlach claimed this task.

The work has been succesfully completed.
The tool is live: https://wikinav.toolforge.org/