Page MenuHomePhabricator

Understanding curious and critical readers (Q2)
Closed, ResolvedPublic

Description

We have sketched a first research proposal around this project to i) scope the problem and ii) reach out potential collaborators T288341.

In this quarter we want to

  • put project on meta
  • formalize collaboration
  • announce research project
  • onboard collaborators (moved to next quarter)
  • generate datasets for first exploratory analysis

Event Timeline

Update week 2021-10-25:

  • ongoing discussions to formalize collaboration with potential collaborators
  • starting to explore how to generate relevant datasets

Update week 2021-11-04:

  • following up with formalizing potential collaboration
  • generated datasets for knowledge networks capturing curiosity based on reading sessions from requests from apps
    • spent some time pre-processing this data since app-requests are treated differently to desktop/mobile web. for example, namespace_id and page_id are not set; talking with data engineering this is expected and is not a bug; in any case this requires some extra steps such as resolving redirects in the pageview requests
    • exploratory analysis of sessions from 1 month in enwiki yields roughly 2M reading sessions from different unique ids; these reading sessions are much longer (mean=28 pageviews) than for desktop/mobile web since they dont rely on approximate fingerprinting
  • generated datasets for critical reading sessions. as a first step, I identified sessions in which readers view the version_history of an article
    • calls to the version-history are identified via the action=history in the query_uri field
    • I wanted to focus on readers, so some preprocessing needed to be done to filter editors
    • from a first small sample of data, roughly 1 out of 500 pageviews lead to a view of the version-history. while this seems low at first, this number is on the same order of magnitude as the rate at which readers engage with citations (see https://arxiv.org/abs/2001.08614). thus this requires some further validation but is promising in capturing another important aspect in which readers engage with the content in a critical way.

Update week 2021-11-15:

  • spent some time to better figure out how version-history is captured in the webrequest logs.
  • for desktop, the signature is in the form https://en.wikipedia.org/w/index.php?title=Marie_Curie&action=history
    • they are captured as normal pageviews (is_pageview=True), i.e. we keep track of the page_title and page_id of the content page
    • the corresponding field action=history is the same across different wikis
  • for mobile (web), calls are handled differently via https://en.m.wikipedia.org/wiki/Special:History/Marie_Curie
    • one of the main limitations is that this signature is different across wikis, e.g. in German the call to version history for the same article is https://de.m.wikipedia.org/wiki/Spezial:Versionsgeschichte/Marie_Curie this will make it much more cumbersome to capture these events systematically in different languages.
    • however, the number of calls to version-history from mobile is much smaller (less than 10% of what we see in desktop), so in a first approximation, we might focus on the signature from desktop.
  • with this, we can systematically capture the extent to which readers visit the version-history and, more importantly, whether there are specific articles that lead to more/less visits of the version-history.

Update week 2021-11-22:

  • finished pipeline to get all version-history requests in enwiki for a full day applying the following filters/processing steps
    • removing editors
    • removing repeated requests for version-history of the same article by the same reader
    • removing requests to version-history of articles outside main namespace (this is challenging since page title and namespace are not properly recorded for these events in the webrequest table so this reuqired lots of manual processing)
  • next step: how common is requesting the version-history. are there article-properties (quality, reliability, degree of controversy, popularity) which lead to more/fewer requests of version history.?

Update week 2022-01-10:

  • followed up with potential collaborators for signing MOU/NDA to formalize collaboration. received positive response and waiting for necessary information to start process with Legal.

Update week 2022-01-17:

  • preparing the MOUs/NDAs for formalizing the collaboration
  • generated a first dataset for exploratory analysis around critical readers. the dataset captures all interactions of readers with the version-history as well as the article talk page of each article in enwiki in one month. we distinguish whether reader was logged-in or not and whether they attempted to edit a page or not in order to understand how users who only read the article make use of these features. in addition, I captured several article features to understand whether readers access version-history and talk pages in specific contexts: i) popularity of pages (total pageviews), ii) topic (language-agnostic topic model), iii) quality (language-agnostic quality prediction), iv) reliability (templates from wiki-reliability). analysis in next tasks.