In this ticket we are going repor our work around "Reference Quality in English Wikipedia"
Description
Description
Details
Details
- Other Assignee
- Aitolkyn
Related Objects
Related Objects
- Mentioned In
- T286923: Source geoprovenance: scope work
Event Timeline
Comment Actions
11/04 ~ 15/04:
- perennial source list references lifespan and pageviews
- continued exploring PySpark
Comment Actions
18/04 ~22/04:
- extract and analyze data from the pageviews table
- check the extracted results with the PageviewsAPI
- check PageviewsAPI
Comment Actions
25/04 ~ 29/04
- extract monthly top-viewed pages and get the pages' revisions at that time
- collect reference quality scores for the revisions of top-viewed pages
- aggregate pageviews for the references from perennial source list
- prepare presentation
Comment Actions
02/05 ~ 06/05:
- re-check the pageviews data for pages in multiple namespaces (namespace_id was added to pageviews_hourly in 2017)
- analysis of the reference quality of the most viewed pages' revisions
- get pages and collect revision data for two datasets: random and top-viewed
Comment Actions
09/05 ~ 13/05
- collect reference risk scores for the 2 datasets: random and top2021
- start collecting all the missing reference need scores
- pageviews for 'bad' sources before and after they are classified as 'bad'
- analyze the data collected so far (significance tests, distributions, plots)
Comment Actions
FYI in case it's useful, here's some code I was using for extracting cite templates on English Wikipedia and joining it with country data inferred based on URLs/publishers: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/references/ref_extraction.ipynb
Comment Actions
23/05 ~ 27/05
- transfer collection of reference need scores to the server (API was too long)
- active user contributors to add perennial sources analysis
- complete the evolution of references in the perennial source list (the future trend is more positive compared to previous data)
Comment Actions
30/05 ~ 03/06
- search for external lists of unreliable sources ( e.g. Melissa Zimdars' fake news websites list )
- compute the coverage of wiki by external lists (1. zimdars and 2. daily dots) and compare with perennial source list
- setting up the environment for the citation quality scores collection
Comment Actions
06/06 ~ 10/06:
- add one more external source list (3. Snopes)
- visualize the intersection of the 3 lists and coverage of #3
- citation quality scores start collecting for top2021 dataset
Comment Actions
13/06 ~ 17/06:
- classify users into exposed and non-exposed in the new random & top datasets
- psm on collaboration between experts and non-experts on the new random and top datasets
- topic coverage of unreliable source lists (including perennials)
- finish collecting citation quality scores for the random dataset
Comment Actions
20/06 ~ 24/06:
- extract user lifespan & analyze the lifespan of users vs. collaboration with experts
- manually check pages containing sources from external fake websites lists
- get dominant sources on wiki from the external lists
- finish collecting citation quality scores for the top dataset
Comment Actions
Updates
- We have received the reviews from the WWW, and submitted the rebuttal. Now, we need to wait for the final decision.
Comment Actions
Updates
- We have submitted the Camera Ready version of this paper.
- We have started working on evaluating references in other languages.