Page MenuHomePhabricator

Reference Quality in English Wikipedia
Closed, ResolvedPublic


In this ticket we are going repor our work around "Reference Quality in English Wikipedia"


Other Assignee

Event Timeline


  • Last week we have done the onboarding on using PySpark and cluster data.
  • @Aitolkyn is starting to explore how to match bad references with pageviews.

@Aitolkyn: please write weekly updates on Fridays

Hi @diego. Could you please associate at least one active project with this task (via the Add Action...Change Project Tags dropdown)? This will allow others to get notified, or see this task when searching via projects. Thanks!

Hi @diego. Could you please associate at least one active project with this task (via the Add Action...Change Project Tags dropdown)? This will allow others to get notified, or see this task when searching via projects. Thanks!


11/04 ~ 15/04:

  • perennial source list references lifespan and pageviews
  • continued exploring PySpark

18/04 ~22/04:

  • extract and analyze data from the pageviews table
  • check the extracted results with the PageviewsAPI
  • check PageviewsAPI

25/04 ~ 29/04

  • extract monthly top-viewed pages and get the pages' revisions at that time
  • collect reference quality scores for the revisions of top-viewed pages
  • aggregate pageviews for the references from perennial source list
  • prepare presentation

02/05 ~ 06/05:

  • re-check the pageviews data for pages in multiple namespaces (namespace_id was added to pageviews_hourly in 2017)
  • analysis of the reference quality of the most viewed pages' revisions
  • get pages and collect revision data for two datasets: random and top-viewed

09/05 ~ 13/05

  • collect reference risk scores for the 2 datasets: random and top2021
  • start collecting all the missing reference need scores
  • pageviews for 'bad' sources before and after they are classified as 'bad'
  • analyze the data collected so far (significance tests, distributions, plots)

FYI in case it's useful, here's some code I was using for extracting cite templates on English Wikipedia and joining it with country data inferred based on URLs/publishers:

It looks very useful, thank you very much!! I'll check this out

16/05 ~ 20/05

  • growth of references in each category of perennial source list
  • comparison of the random and top10 ref. quality evolution
  • read nature papers: paper1, paper2 --> good visualizations

23/05 ~ 27/05

  • transfer collection of reference need scores to the server (API was too long)
  • active user contributors to add perennial sources analysis
  • complete the evolution of references in the perennial source list (the future trend is more positive compared to previous data)

30/05 ~ 03/06

  • search for external lists of unreliable sources ( e.g. Melissa Zimdars' fake news websites list )
  • compute the coverage of wiki by external lists (1. zimdars and 2. daily dots) and compare with perennial source list
  • setting up the environment for the citation quality scores collection

06/06 ~ 10/06:

  • add one more external source list (3. Snopes)
  • visualize the intersection of the 3 lists and coverage of #3
  • citation quality scores start collecting for top2021 dataset

13/06 ~ 17/06:

  • classify users into exposed and non-exposed in the new random & top datasets
  • psm on collaboration between experts and non-experts on the new random and top datasets
  • topic coverage of unreliable source lists (including perennials)
  • finish collecting citation quality scores for the random dataset

20/06 ~ 24/06:

  • extract user lifespan & analyze the lifespan of users vs. collaboration with experts
  • manually check pages containing sources from external fake websites lists
  • get dominant sources on wiki from the external lists
  • finish collecting citation quality scores for the top dataset


  • We are currently working on writing the results
diego renamed this task from Reference Quality in English Wikipedia / Internship to Reference Quality in English Wikipedia.Sep 10 2022, 3:49 PM
diego claimed this task.
diego updated Other Assignee, added: Aitolkyn.
diego updated the task description. (Show Details)


  • We have received the reviews from the WWW, and submitted the rebuttal. Now, we need to wait for the final decision.


*The final decision should be out this week.


  • Our paper was conditionally accepted at TheWebConf'23 (a.k.a WWW'23)


  • We have submitted the Camera Ready version of this paper.
  • We have started working on evaluating references in other languages.


  • The paper was officially accepted in WWW'23. We made some final updates to the text.

@Aitolkyn please upload the final version to arxiv and add the link here.


We are open to any feedback and comments!