Run bulk analysis of readability scores on different Wikipedias
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MGerlach
	Nov 3 2022, 5:23 PM

Description

We developed and evaluated a language-agnostic model to assign readability scores to Wikipedia articles in T299091.

Here, we want to run a bulk analysis of readability scores for all articles in the supported Wikipedias. The aim is to understand the variation in readability within and across wikis.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		MGerlach	T293028 [EPIC] Initiate Multilingual Readability Research
		Resolved		MGerlach	T322354 Run bulk analysis of readability scores on different Wikipedias

Event Timeline

MGerlach created this task.Nov 3 2022, 5:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 3 2022, 5:23 PM

weekly update:

Calculated readability scores for a random subset of 1000 articles in 17 wikis (cawiki, dawiki, dewiki, enwiki/simplewiki, eswiki, fiwiki, frwiki, huwiki, itwiki, nlwiki, nowiki, ptwiki, rowiki, ruwiki, svwiki, trwiki)
- Ideally, we would like to get readability scores for every article in a project. However, when calculating the language-angostic scores, we need to call the DBPedia-spotlight API to get the language-agnostic features. This step is time-consuming. Therefore, in a first iteration, we only get a representative sample
- Similarly, we focus on 17 wikis since those are the languages which are supported by the DBPedia-spotlight API
We see noticable differences in the distribution of readability scores (low=easier to read/high=harder to read), specifically, simplewiki has much lower scores than enwiki

Screenshot from 2022-11-03 18-32-11.png (1×448 px, 80 KB)

weekly update:

created slightly adapted model which maps the readability score from the language-agnostic model to a grade level (roughly the number of years of education needed to understand a text) since that is more interpretable. with this we can map the distribution over (a random subset of) articles of a wiki.

Screenshot from 2022-11-10 18-50-27.png (1×457 px, 78 KB)

with this we can compare the grade levels of i) articles: the number of years of education needed to understand the text; and ii) readers: the self-reported number of years of education from respondents to a demographics survey . as an example: enwiki

Screenshot from 2022-11-10 18-54-33.png (435×580 px, 39 KB)

I created a public experimental API to surface the results from the language-agnostic readability model: https://readability.toolforge.org/

weekly update

the current bottleneck in the pipeline is the call of the public API of dbpedia-spotlight. this is not scalable to score all articles of a dump.
instead, tested to run a local instance of dbpedia-spotlight. this yields up to 100x fold speedup suggesting it is feasible to use this approach to score readability for all articles of a dump
next: pre-processing one dump and setting up a pipeline with the local instance of dbpedia-spotlight

weekly update:

talked to folks from data engineering and got good suggestions on how to run the model on the stat-machines. - next: will try to implement those suggestions and/or get additional support

weekly update:

figured out how to run local instance of dbpedia-spotlight on stat.machines
next step: build a pipeline to run the full model to get readability scores of all articles of a dump

weekly update:

refactoring code and re-training the model using the local instance of dbpedia-spotlight

weekly update:

re-training and evaluating the model using local instances of dbpedia-spotlight
as a next step we can run the model on all articles in the corresponding wikis using the local instance

leila moved this task from FY2022-23-Research-October-December to FY2022-23-Research-January-March on the Research board.Mar 29 2023, 10:42 PM

leila edited projects, added Research (FY2022-23-Research-January-March); removed Research (FY2022-23-Research-October-December).

weekly update:

set up a pipeline to run the language-agnostic model to get readability scores for all articles in a dump
however, I am putting the language-agnostic model on hold for now. With Mykola, we have finished the evaluation of an alternative language-dependent model (based on mBERT) adapting the methodology from the existing revert-risk model. the advantage of this model is that i) it significantly outperforms the language-agnostic model in all languages but 1, ii) it supports many more languages than the language-agnostic model (which currently depends on dbpedia-spotlight which supports only around 20 languages), iii) it is a single multilingual model (in contrast the language-agnostic model requires a separate dbpedia-spotlight model for entitiy linking in each language).
given the unexpectedly strong performance of the multilingual model, I am planning to replace the language-agnostic model

MGerlach moved this task from FY2022-23-Research-January-March to FY2022-23-Research-April-June on the Research board.Apr 13 2023, 6:56 AM

MGerlach edited projects, added Research (FY2022-23-Research-April-June); removed Research (FY2022-23-Research-January-March).

weekly update:

Adapting new multilingual model to run scores on dumps; Mykola and I are working on [[ https://gitlab.wikimedia.org/repos/research/readability/-/merge_requests/1 | merge request ]]with inference script
Adding regression model to map model's output to a more interpretable score in terms of expected number of years of formal education (similar to Flesch-Kincaid score)
Mykola started working on additional function to do batch prediction to facilitate running model on all articles

weekly udpate:

merged notebook with the basic inference script to predict readability scores for individual articles (MR)

weekly update

merged script with batch prediction (MR)
conducted performance test for batch prediction using GPUs (around 4 hours for 1M articles)
next: planning pipeline to run batch prediction for a dump of a single snapshot

weekly update:

no update

weekly update:

no update

weekly update:

added regression functionality to transform the model's classification score into a more interpretable similar to the Flesch-Kincaid grade level ([[ added regression model to get pseudo FK-scores https://gitlab.wikimedia.org/repos/research/readability/-/merge_requests/4 | merge-request ]])

weekly update:

started working on the pipeline for running the scripts for batch processing.
spent some time setting up the environment to make sure that scripts use GPUs
ran pipeline succesfully on subset of around 20K articles (with GPUs this takes only a few minutes)
next step is to work on a script to pre-process the dump-files so that the batch-inference script can easily be run for all articles of the respective dump

weekly update:

put together full pipeline that runs batch inference for all articles in the supported wikis
- pre-processing dump files of relevant wikis (extract plain text, split into sentences)
- prepare data format required for batch inference on GPUs
- collect results in single file
running this for all 104 wikis takes a few days but should be ready to publish next week

weekly update:

calculated readability scores in bulk for all articles in the 2023-06 snapshot of 104 supported Wikipedias (code)
made table[[ https://analytics.wikimedia.org/published/datasets/one-off/mgerlach/readability/ | publicly available ]] as a one-off dataset. after some additional inspection, we will move this to a more stable URL in figshare or so.
added documentation to the meta-page https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research/Evaluation_multilingual_model#Resources

	F35745505: Screenshot from 2022-11-10 18-54-33.png
	Nov 10 2022, 5:55 PM

	F35745446: Screenshot from 2022-11-10 18-50-27.png
	Nov 10 2022, 5:55 PM

	F35703562: Screenshot from 2022-11-03 18-32-11.png
	Nov 3 2022, 5:33 PM

Run bulk analysis of readability scores on different WikipediasClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Run bulk analysis of readability scores on different Wikipedias
Closed, ResolvedPublic
Actions

Related Objects
Search...