Page MenuHomePhabricator

Start research project on navigation paths "How we read wikipedia"
Closed, ResolvedPublic

Description

Problem Statement
In contrast to Wikipedia’s editor population, little is known about its readers; in large parts due to the challenges and restrictions when dealing with privacy-sensitive data. Only recently have we started to characterize wikipedia’s readership. For example, recent studies (Singer, Lemmerich, et al. 2017; Lemmerich et al. 2019) approached the question why we read wikipedia in order to identify the motivation, information need, and prior knowledge of different users. Here, we investigate whether and to what degree this is reflected in how we use wikipedia. That is, instead of looking at page views as isolated events, we consider user’s full reading session in order to characterize patterns of navigation within and across wikimedia projects, and as a result, better understand the context of usage.

Goals
Empirical characterization of navigation paths of users on Wikipedia.

  • Quantify difference across Wikipedia editions, geographical location, mobile/user access, topical content, etc.
  • Identification of navigation patterns related to the motivation (work, learning, etc.), information need (overview, fact, etc.), and the prior experience (familiar, unfamiliar) of the user.

Approach

  • Collect sample data for navigation paths/trees from webrequest logs for different wikimedia-projects; define consistent methodology for pre-processing and filtering .
  • Exploratory analysis of navigation paths
    • empirical characterization of paths and quantifying their differences across projects, geography, access-method, etc.
    • supervised/unsupervised clustering to identify and quantify prevalence of different types of navigation
  • compile list of use-cases for applicability across departments (e.g. product) and coordinate possible efforts.

Based on the research brief

A summary of the results in the first phase of exploratory analysis can be found on meta

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2020, 12:47 PM
MGerlach renamed this task from Exploratory research "How we read wikipedia" to Start research project on navigation paths "How we read wikipedia" .Jan 13 2020, 4:41 PM
MGerlach updated the task description. (Show Details)

Weekly update: Finished Research Brief sketching project.

Weekly update: Meeting with Bob, Tiziano, and Akhil. Agreed on a preliminary dataset (1 week of logs) to explore choices and parameters for construction of navigation sessions (only consecutive internal clicks) and reading sessions (consecutive page views within a time-window Delta-t).

leila added a comment.Jan 21 2020, 8:40 PM

thanks for the update. see below.

Weekly update: Meeting with Bob, Tiziano, and Akhil. Agreed on a preliminary dataset (1 week of logs) to explore choices and parameters for construction of navigation sessions (only consecutive internal clicks) and reading sessions (consecutive page views within a time-window Delta-t).

  • Make sure you include data when users switch from one WP language to another, as well as from one project to another. If in this first data-set you don't want to include this data to keep it simple early on, it's fine, but let's not miss it in the second data-set. :)
  • When you say navigation sessions (only consecutive internal clicks): how are you planning to handle the case where the user leaves the site and comes back some seconds later? Will you break down the session? discard the session altogether? or will you use some other approach?
  • Make sure you include data when users switch from one WP language to another, as well as from one project to another. If in this first data-set you don't want to include this data to keep it simple early on, it's fine, but let's not miss it in the second data-set. :)

Yes, we explicitly discussed this. Indeed, in the first dataset, it is not included (enwiki). We will definitely include in later datasets. This became even more clear in discussions in the desktop-refresh session at all-hands: while there is some intuition on language switching, we have limited insights about this.

  • When you say navigation sessions (only consecutive internal clicks): how are you planning to handle the case where the user leaves the site and comes back some seconds later? Will you break down the session? discard the session altogether? or will you use some other approach?

In the view of a navigation session, we will consider this case as two separate sessions. As I understand, the underlying idea is that the sessions (even if in very close time-proximity) can be very different in terms of the topic or intention of the session. This is a very conservative way of grouping pageviews into sessions and, thus, yields hopefully more homogeneous navigation paths. There is a hypothesis that this is a better approach in the context of link-prediction. In addition, we circumvent the (seemingly) arbitrary choice for the cutoff of 1 hour (see below).
At the same time, we will consider 'reading sessions', where we would consider the above described case as one single session if the time-difference between the two views is smaller than, e.g., 1 hour. At the moment, we are trying to define a clear cut-off for reading sessions. While previous research indicated a clear bimodal distribution of interevent times motivating a well-defined cutoff at 1 hour [1], Tiziano's preliminary results do not show such a bimodal behaviour. Thus, we are revisiting this issue to make a well-informed decisions.

[1] Halfaker, A., Keyes, O., Kluver, D., Thebault-Spieker, J., Nguyen, T., Shores, K., … Warncke-Wang, M. (2015). User Session Identification Based on Strong Regularities in Inter-activity Time. Proceedings of the 24th International Conference on World Wide Web, 410–418. International World Wide Web Conferences Steering Committee.

leila added a comment.Feb 3 2020, 4:38 PM

All clear and makes sense to me. thanks!

Weekly updates:

  • during all-hands, I started many interactions with different teams that were interested in this line of research. I am following up this week to discuss possible use-cases of this research:
    • Alex Hollender, Olga Vasileva from Desktop Refresh
    • Margeigh Novotny from Product Design
    • Alex Stinson from Community Strategy
    • Andrew Green from Tech Fundraising

Weekly update:

  • Started analyzing preliminary data on reading and navigation sessions (1 week of logs from enwiki) and defined core metrics to look at:
    • Distribution of session length
    • Distribution of breadth-vs-depth
    • map sessions into topical space via embeddings (e.g. CR5)

Weekly update 02/17:

  • solved some technical issues on how to access sample-data via spark (standard kernel settings will fail; experiment with custom kernels in absence of clear recommendations; see this related ticket T245097)
  • ongoing analysis on coarse summary statistics of sessions (distribution length, ...) depending on access-methods
  • obtained topics (embedding vectors as well as from isaacs wikidata-topic model ) for all articles

Upcoming:

  • session-statistics as a function of topic
  • distribution of summary statistics of sessions in embedding space (focus, spread)

Weekly update 02/24:

  • continued exploratory analysis of navigation sessions with respect to topics of articles
  • ran Isaac's wikidata-topicmodel to assign topic to every article (independent of language)
  • topical interest (fraction of pageviews for each topic) largely independent on access-method (dektop, mobile, app)
    • looked at session length as a proxy for engagement
      • mobile on average 1 pageview less than desktop; surprisingly the app almost indistinguishable from desktop
      • *average session length* shows strong variation depending on topic of 1st pageview in session: ~2 (Culture.internet culture, geography.maps) to ~4 (Culture.sports, Culture.Arts)

Next:

  • obtain topic-vectors for each article to track jump-sizes in navigation sessions

Weekly update 03/02:

  • obtained embedding vectors for all English articles from text of 1st paragraph
  • validation of embedding vectors via comparison to topic-labels from isaac's wikidata-topicmodel
      • precision@k using annoy (i.e. for a randomly selected article is there an article with the same topic among the k nearest neighbors?): prec@1>0.8 (random baseline <0.1).
    • visual inspection of 2D-reduced embedding using umap shows strong clustering of indivual topics

Next:

  • measurement of trajectories in embedding space (e.g. average, spread, jump-size distribution)

Weekly updates 03/09:

  • measuring trajectories in topic space
    • jump size distribution
    • spread/topical coherence of sessions
  • Ongoing:
    • comparison to null models (random sequences of pageviews)
      • draw pageviews randomly according to global popularity (i.e. marginal distribution)
      • simulate random walks on the link-network

Weekly updates 03/16:

  • random walk on link-network as null model for navigation for small subsample
  • correlate topical coherence of sessions to responses from reader-survey
  • presented preliminary results in research-meeting
  • Next:
    • random walk on link-network as null model for navigation for full dataset (move code to spark)
    • robustness checks wrt definition of reading sessions (1 hour cutoff?)

Weekly updates 03/16

  • implemented 2 random walk models for navigation on link network as comparison to real data ( [i] all links equally likeley, [ii] based on transition probabilities from clickstream data)
  • obtained set of novel results on learning during reading session: topical focus follows charactertistic U-shape (i.e. during the first half of the reading session the topical focus becomes narrower on average with clear minimum; in the second half the topical spread grows and becomes even larger than just in a random walk in the link network)

Next:

  • summarizing results and writing meta-page
  • framing follow-up work in collaboration with Bob, Tiziano, Akhil
leila added a comment.Mar 23 2020, 4:19 PM

You can resolve this task, Martin, and create a next one for the next quarter.

MGerlach closed this task as Resolved.Mar 26 2020, 10:32 AM
MGerlach updated the task description. (Show Details)

added summary of results to meta.
closing this task