Page MenuHomePhabricator

Investigation: Wikipedia citations research from Natallia Kokash and Giovanni Colavizza
Closed, DeclinedPublicSpike

Description

As a PM and Eng, I'd like to investigate the research and codebase from Natallia Kokash and Giovanni Colavizza, so we can evaluate its use for our parsing/quality of references work.

Info
Background

Abstract from the above paper gives a good overview of this project:
"Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump."

To Do
  • Read the article and have a look at both the dataset and the code base to get a good understanding of their work
  • Take notes of interesting findings for our planned reference parsing and quality work
  • Set up sync with PMs (Francisco and Stephanie) to discuss findings