Page MenuHomePhabricator

Run bulk analysis of readability scores on different Wikipedias
Closed, ResolvedPublic

Description

We developed and evaluated a language-agnostic model to assign readability scores to Wikipedia articles in T299091.

Here, we want to run a bulk analysis of readability scores for all articles in the supported Wikipedias. The aim is to understand the variation in readability within and across wikis.

Event Timeline

weekly update:

  • Calculated readability scores for a random subset of 1000 articles in 17 wikis (cawiki, dawiki, dewiki, enwiki/simplewiki, eswiki, fiwiki, frwiki, huwiki, itwiki, nlwiki, nowiki, ptwiki, rowiki, ruwiki, svwiki, trwiki)
    • Ideally, we would like to get readability scores for every article in a project. However, when calculating the language-angostic scores, we need to call the DBPedia-spotlight API to get the language-agnostic features. This step is time-consuming. Therefore, in a first iteration, we only get a representative sample
    • Similarly, we focus on 17 wikis since those are the languages which are supported by the DBPedia-spotlight API
  • We see noticable differences in the distribution of readability scores (low=easier to read/high=harder to read), specifically, simplewiki has much lower scores than enwiki

Screenshot from 2022-11-03 18-32-11.png (1×448 px, 80 KB)

weekly update:

  • created slightly adapted model which maps the readability score from the language-agnostic model to a grade level (roughly the number of years of education needed to understand a text) since that is more interpretable. with this we can map the distribution over (a random subset of) articles of a wiki.

Screenshot from 2022-11-10 18-50-27.png (1×457 px, 78 KB)

  • with this we can compare the grade levels of i) articles: the number of years of education needed to understand the text; and ii) readers: the self-reported number of years of education from respondents to a demographics survey . as an example: enwiki

Screenshot from 2022-11-10 18-54-33.png (435×580 px, 39 KB)

weekly update

  • the current bottleneck in the pipeline is the call of the public API of dbpedia-spotlight. this is not scalable to score all articles of a dump.
  • instead, tested to run a local instance of dbpedia-spotlight. this yields up to 100x fold speedup suggesting it is feasible to use this approach to score readability for all articles of a dump
  • next: pre-processing one dump and setting up a pipeline with the local instance of dbpedia-spotlight

weekly update:

  • talked to folks from data engineering and got good suggestions on how to run the model on the stat-machines. - next: will try to implement those suggestions and/or get additional support

weekly update:

  • figured out how to run local instance of dbpedia-spotlight on stat.machines
  • next step: build a pipeline to run the full model to get readability scores of all articles of a dump

weekly update:

  • refactoring code and re-training the model using the local instance of dbpedia-spotlight

weekly update:

  • re-training and evaluating the model using local instances of dbpedia-spotlight
  • as a next step we can run the model on all articles in the corresponding wikis using the local instance

weekly update:

  • set up a pipeline to run the language-agnostic model to get readability scores for all articles in a dump
  • however, I am putting the language-agnostic model on hold for now. With Mykola, we have finished the evaluation of an alternative language-dependent model (based on mBERT) adapting the methodology from the existing revert-risk model. the advantage of this model is that i) it significantly outperforms the language-agnostic model in all languages but 1, ii) it supports many more languages than the language-agnostic model (which currently depends on dbpedia-spotlight which supports only around 20 languages), iii) it is a single multilingual model (in contrast the language-agnostic model requires a separate dbpedia-spotlight model for entitiy linking in each language).
  • given the unexpectedly strong performance of the multilingual model, I am planning to replace the language-agnostic model

weekly update:

  • Adapting new multilingual model to run scores on dumps; Mykola and I are working on [[ https://gitlab.wikimedia.org/repos/research/readability/-/merge_requests/1 | merge request ]]with inference script
  • Adding regression model to map model's output to a more interpretable score in terms of expected number of years of formal education (similar to Flesch-Kincaid score)
  • Mykola started working on additional function to do batch prediction to facilitate running model on all articles

weekly udpate:

  • merged notebook with the basic inference script to predict readability scores for individual articles (MR)

weekly update

  • merged script with batch prediction (MR)
  • conducted performance test for batch prediction using GPUs (around 4 hours for 1M articles)
  • next: planning pipeline to run batch prediction for a dump of a single snapshot

weekly update:

weekly update:

  • started working on the pipeline for running the scripts for batch processing.
  • spent some time setting up the environment to make sure that scripts use GPUs
  • ran pipeline succesfully on subset of around 20K articles (with GPUs this takes only a few minutes)
  • next step is to work on a script to pre-process the dump-files so that the batch-inference script can easily be run for all articles of the respective dump

weekly update:

  • put together full pipeline that runs batch inference for all articles in the supported wikis
    • pre-processing dump files of relevant wikis (extract plain text, split into sentences)
    • prepare data format required for batch inference on GPUs
    • collect results in single file
  • running this for all 104 wikis takes a few days but should be ready to publish next week

weekly update: