As a PM and Eng, I'd like to investigate the research and codebase from Natallia Kokash and Giovanni Colavizza, so we can evaluate its use for our parsing/quality of references work.
Info
- Article: "Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia" https://arxiv.org/abs/2406.19291
- Code: https://github.com/albatros13/wikicite
- Dataset: https://zenodo.org/records/11210434
- Repo includes list of top EN refs templates (https://github.com/albatros13/wikicite/blob/master/data/top300_templates.csv) as well as templates per language (listed in _init.py_ files per language, eg https://github.com/albatros13/wikicite/blob/multilang/libraries/wikiciteparser/wikiciteparser/ca/__init__.py)
- References PRD in Drive (Product > Machine Readability > Epics > References)
Background
Abstract from the above paper gives a good overview of this project:
"Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump."
To Do
- Read the article and have a look at both the dataset and the code base to get a good understanding of their work
- Take notes of interesting findings for our planned reference parsing and quality work
- Set up sync with PMs (Francisco and Stephanie) to discuss findings