Page MenuHomePhabricator

Curiosity: exploratory analysis and onboarding collaborators (Q3)
Closed, ResolvedPublic

Description

In https://phabricator.wikimedia.org/T293037 we generated a first dataset for understanding critical readers and kicked off a new formal collaboration.
In Q3, the goals are:

  • exploratory analysis of reader interactions with version history and talk pages
  • onboard collaborators, coordinate start of the collaboration

Event Timeline

Update week 2022-02-07:

  • coordinating first steps and onboarding with collaborators
  • starting exploratory analysis on critical readers by looking at click-through-rates of articles to version-history and talk pages (based on 1 month of webrequest):
    • version-history and talk-pages are visited in roughly 1 out of 1000 pageviews
    • these visits cannot be attributed only to editors: a substantial fraction of these visits come from readers that do not attempt to edit (though the CRT is slightly lower when restricting to non-editors)
    • notable exception: version-history is very rare for mobile-web device

Version-history

CRT: 1/n pageviews (percentage)all readersreaders (non-editors)
desktop1/735 (0.136%)1/1,604 (0.062%)
mobile web1/27,615 (0.004%)1/38,263 (0.003%)
desktop+mobile_web1/1,926 (0.052%)1/4,101 (0.024%)
  • version-history almost exclusively from desktop readers (calls from mobile-web are an order of magnitude lower and thus negligable)
  • considering all readers, roughly 1 out of 700 pageviews results in a request to a version-history
  • not all of the version-history calls come from editors; only considering readers that have not edited (or attempted to edit) roughly 1 out of 1600 pageviews yield a request for a version history (roughly half the CRT of all readers)

Talk-pages

CRT: 1/n pageviews (percentage)all readersreaders (non-editors)
desktop1/1664 (0.060%)1/2,333 (0.043%)
mobile web1/1085 (0.092%)1/1,108 (0.090%)
desktop+mobile_web1/1,243 (0.080%)1/1,370 (0.073%)
  • visits to talk-pages are on the same order of magnitude for desktop and mobile web
  • roughly 1 out of 1000 pageviews results in a visit to a talk-page of an article
  • for desktop, the CRT is about 1 out of 2000 when only considering readers without edits, but stays the same for requests from mobile
  • Next step: investigate which pages yield high and low click-through rates to version-history and talk-pages

Update week 2022-02-14:

  • some additional comments to global click-through rates from previous week. for the version-history, the crt is an order of magnitude lower for mobile than for desktop. I realized that a potential cause for this difference is the position of the corresponding button: for desktop (higher crt) it is at the top whereas for mobile (lower crt) it is at the bottom. interestingly, the crt for article-talk pages is similar for desktop and mobile; here the button is at the top of the article in both cases. one hypothesis would be that readers on mobile would use the version-history more often if it was displayed more prominently on the top.
  • fixed some bugs in getting the requests to version-history. I observed that some pages had click-through rates to their version-history larger than 1. This was caused by a substantial number of seemingly automated requests to version-history via feedreaders where the query string contains "feed=rss". adding a filter solved this problem. overall results do not change.
  • I did some analysis to understand for which types of pages readers look up the talk-pages and version-histories. I compared click-through rates with the following article features: popularity (number of pageviews), editing activity (number of edits to article or article-talk page respectively), quality, topic, reliability (existence of one of the WikiReliability templates); performing linear regression and inspecting the corresponding coefficients I got a first rough look on the most predictive features.

Talk-pages: articles with higher crt had lower quality, hoax-template, the pov-template, or disputed-templated. while, some of these templates introduce an explicit link with text to the article's talk page at the top of the main article, it is nevertheless interesting to observe that readers follow these links. More generally, it is encouraging that crt to talk-pages is correlated to quality- and reliability issues of articles.
Version-history: articles with higher crt had higher quality, higher editing activity (both main namespace and article-talk namespace), and the Third-party template. this suggests a different dynamics for when version-history is used by readers. this will require some further analysis.

Update week 2022-02-21:

  • kick-off meeting with collaborators. starting the technical onboarding for access to analytics cluster
  • generated datasets of 2M knowledge networks from reading sessions following methodology Lydon-Staley et al 2020

Update week 2022-02-28:

Update week 2022-03-07:

  • still working through steps to get access for collaborators
  • revisiting technical steps to adapt methodology used for generating and analysing knowledge networks to our case (e.g. how to take into account that the size of the knowledge networks varies a lot when calculating network properties such as clustering)

Update week 2022-03-14:

  • collaborators have access to stat-machines. we have spent some time this week to understand how to work with the stat-machines: Start setting up virtual environments and how to install necessary packages (especially those related to network analysis); starting to analyze sample data
  • we are blocked on ldap-access for collaborators required for using jupyter-notebooks (T303702 , T303703)
  • worked together with collaborators to write down and agree on the details of our analysis-plan so we know what are the exact steps in generating and processing knowledge networks

Thanks for the update. One point below:

  • we are blocked on ldap-access for collaborators required for using jupyter-notebooks (T303702 , T303703)

If this is not resolved by Wednesday 8:00 PST, please ping me/Miriam and we will add it to the stand-up material for that day (we need to report what we're blocked on on a weekly basis.).

  • We solved access for collaborators to stat-machines, including use of jupyter notebooks.
  • We discussed the analysis steps to generate and analyze knowledge networks from an existing set of reading sessions which we will begin next week. We agreed that the following information will likely be relevant to capture in the generation of the networks: i) existing hyperlinks, ii) transition by the reader, iii) similarity between all pairs of nodes based on text (weighted, undirected). We agreed on a set of network metrics (such as clustering) to measure those networks. We will compare metrics with several null models based on random walks to have a reference point that takes into account variation in, e.g., size of the networks
  • I started to extend analysis on reader interactions with talk-pages and version-history beyond enwiki to other languages. The motivation is that placement of the corresponding buttons differs across languages (e.g. the mobile-version talk-page button is at the top in enwiki but at the bottom of the article in eswiki) and it will be interesting to see how this affects click-through rates. The challenge was to get the language-specific namespaces and aliases (e.g. mobile version-history calls in enwiki “Special:History/<page-title> vs eswiki “Especial:History”, see code to automatically extract for all languages) and filter the webrequest-logs in each language for version-history and talk-page calls from readers. I managed to extract the data for all languages and am planning next week to analyze a few languages to compare corresponding click-through rates.

Weekly update

  • I wrote up a summary of my first exploratory analysis on how readers interact with an article's version-history and talk-page, respectively. This is a first step to better understanding how readers on Wikipedia critically engage with the information they find. The motivation is that many teaching guides on how to assess the quality and trustworthiness of information in the context of Wikipedia recommend to check the version-history and talk-pages to see how the information was created and what are perspectives of contributors. This analysis provides some answers to how much readers actually use this and for what types of pages.
  • In short: There is substantial engagement with version-history and talk-pages by readers, in particular not only by editors but also by readers who do not edit. In some cases (specifically desktop) engagement is at around 1:1000 pageviews which is almost as high as engagement with citations (Piccardi et al.); results suggest that the position of the corresponding button has a big influence on the engagement. For example, on mobile the talk-page is always at the top in enwiki but in other wikis only visible to logged-in users: engagement with talk-pages in enwiki is even higher on mobile than on desktop whereas in other wikis it is almost non-existent. There are some indication that existence of reliability issues in articles (via templates) correlates with engagement of version-history and talk-pages.
  • Link to the meta-page with the details: https://meta.wikimedia.org/wiki/Research:Understanding_Curious_and_Critical_Readers/Reader_interactions_with_talk-pages_and_version-history