Page MenuHomePhabricator

[long] Understanding data voids on Wikipedia
Open, Needs TriagePublic

Description

Overview

Readers come to Wikipedia as a valuable source of information for a wide variety of reasons. Past research has identified the main reasons as: work/school-related assignment, making a personal decision (e.g., buy a book, choose a travel destination), learning about a current event, reading more about a topic that was referenced in a piece of media (e.g., TV, radio, article, film, book) or that came up in a conversation, randomly exploring Wikipedia for fun, and the topic being personally important (e.g., to learn about a culture).

Hopefully, regardless of the context, the reader finds high-quality content that satisfies their need. On occasion, however, many readers might be interested in a topic but find low-quality content about it. When this is about particularly impactful topics -- e.g., politics -- this can be especially unfortunate. The editor community generally does an excellent job of avoiding these issues -- e.g., the vast organization and work done to improve content on Covid-19 and other current events -- but sometimes the spike in interest happens faster than the editor community can respond, as might be the case for topics that are referenced in a piece of media .

Past research provides some insight into the nature of these low-quality, high-interest areas. Warncke-Wang et. al. evaluated what types of content tend to be misaligned -- i.e. high demand but of low quality. Golebiewski and boyd propose the term data void and describe instances where a lack of content can be particularly problematic. The social media traffic report captures articles that receive a lot of traffic directly from common social media platforms, though this content is not juxtaposed with information about content quality.

Task

You will conduct research to better understand characteristics of data voids on Wikipedia and how they might be addressed (if possible). In particular, you will focus on low-quality content that has existed for a while and received minimal pageviews but then experiences a large spike in interest from readers. This work will likely have the following steps:

  • Identify a basic heuristic for detecting spikes in reader interest in a given article.
  • Collect articles within the past year that were at least six months old and experienced spikes in reader interest.
  • Take a sample of these articles -- at least 50 -- and qualitatively code them for what caused the spike in interest and how predictable it was, how impactful the topic was, and what impact the spike had on editorship and article quality.

This task is considered [long]. In general, it's expected that the task will take a few months of consistent work and is a good fit for someone with some research experience or interest in being involved in research. The actual time needed, however, will depend greatly on your level of experience.

Rationale

Little is known about data voids on Wikipedia but, by definition, they can be highly impactful to readers and likely addressed with concerted action. If this research can identify data voids that may have been preventable or potential approaches to early-warning systems, this could hopefully lead to the design of tools to support editors in addressing these voids.

Recommended Skills

  • This task will require some basic understanding of Python for collecting article data.
  • Additional experience with the following is helpful but not necessary:
    • Wikimedia dumps and APIs, particularly for pageviews and potentially article quality
    • Qualitative coding

Acceptance Criteria

  • The output of this task will be a Meta report describing the research and findings (example). Depending on researcher and mentor interest, this could be expanded into a more formal publication.
  • At each stage of the analysis, the choices should be carefully grounded in past research and validated by the mentor.

Process

  • If you are interested in this task and it is not assigned to anyone, you may begin work on it. Please leave a comment on the task and tag the mentor so that they are aware.
  • The first step is a literature review of existing scholarly work on spikes in reader interest (burstiness) on Wikipedia -- e.g., Miz et al. What is Trending on Wikipedia?.
  • If you have made some progress on the task (a draft literature review and recommendation for approach to identifying articles that received a spike in interest) and would like to complete it, share a link to your current draft and let the mentor know so that they can assign the task to you and help you to plot out the next steps.
  • Generally, <mentor> will be able to answer any questions about the task and try to respond quickly when clarification is necessary.

Resources