Page MenuHomePhabricator

Initiate Multilingual Readability Research (Q1)
Closed, ResolvedPublic

Description

Readability is one facet of the taxonomy of knowledge gaps. Currently, we do not know how to measure readability in Wikimedia Projects. Therefore, we want to initiate research on how to measure readability across languages.

As a first step in Q1, we will focus on:

  • Conduct a literature review on available methods, tools, and datasets for measuring readability
  • Identify potential collaborators who could this work

Event Timeline

Update week 2021-08-16

Update week 2021-08-23:

  • continuing to read literature and organizing different approaches

Update week 2021-08-31:

mainly organized my thoughts from reading 10-20 papers and summarized my findings around readability

  • What is readability
  • How can it be measured
  • What are different datasets
  • Define recommendations for next steps
  • Identified 3 candidate approaches to measure readability in line with the approaches outlined in our discussion on multilingual Language-Modeling
    • Language-specific: Traditional readability formulas are very common (such as the Flesch-Kincaid test) and seemingly straightforward to implement (Collins-Thompson 2014). However, this becomes extremely challenging when trying to scale across languages because they use syllable- or word-level features which require very language-specific parsing. In addition, the formulas often have to be customized to each language. I do not expect this to be scalable across languages but merely to define a (weak) baseline.
    • Language-dependent: Multilingual language models (such as mBERT or XLM-R) represent sentences or paragraphs as a stream of characters without aiming to capture individual words which would require language-specific parsing. These models support 100+ languages. The features derived from these representations have been shown to be useful for capturing readability in individual languages (Martinc et al. 2021).
    • Language-agnostic: Represent text (sentences) as a sequence of entities (e.g. Wikidata-item) and not words from which we can derive readability-features. This has been shown to work well for English (Stajner & Hulpuș, 2020), but given the language-agnostic representation of the text this is promising to scale to other languages, provided that one has a good entity-linker (though this seems feasible given DBpedia-spotlight exists in several languages and provides tooling to extend to other languages)
  • Frame the task as supervised classification where texts are labeled with different readability levels (e.g. Simple/English Wikipedia or Vikidia /Wikipedia or elementary/intermediate/advanced (OneStopEnglish)).
    • The main challenge of the task is that labeled data only exists for few languages. Ideally, the language-agnostic/dependent models will allow us to make accurate predictions for languages that were not explicitly part of the training data similar to the approach in Language-agnostic-topic prediction.

Next steps:

  • capture project on a meta-page for the project
  • think whether it makes sense to reach out to potential collaborators as possible approach become more clear
  • start exploratory analysis on i) datasets, ii) different representations above.

Update week 2021-09-06:

Update week 2021-09-20:

    • I spent some time this week to narrow the scope of the work in this project. I wrote a research plan for the next 1-2 quarters outlining the specific steps for the analysis. https://docs.google.com/document/d/1mkaIo6cPG90-FvSu2FqLXRHaeu1VQI2ixsLTZOMSSQY/edit#
      • how to generate datasets needed for training and evaluation of the different approaches. specifically, the same article should be available in two different levels of readability (e.g. Simple Wikipedia and English Wikipedia).
      • how to implement the two main approaches to define language-agnostic or multilingual features to capture readability
      • an unsupervised and a supervised task to evaluate how well these features capture readability. we define different variations to assess how well these approaches capture readability in different languages, specifically for languages in which we did not have training data available. the latter is the most important aspect since for most languages we will not be able to obtain texts with annotated labels of the readability level.
  • Reached out to Mike Raish from the Design Research Team to discuss this project in the Design Research Clinic to get additional feedback from them, especially given Mike's expertise in languages. They already suggested additional evaluation through human-derived readability judgements to validate the models.

Update week 2021-09-27:

  • Met with Michael Raish from Design Research to discuss the project idea in more detail:
    • we identified some important caveats when conducting the experiments; most importantly we would have to account for variability in genre/topics (easy), technical languages (maybe), and cultural differences (not sure how to account for this)
    • Mike convinced me that, if we were to develop a metric/score for readability that works across projects, we would also want to validate whether this also matches readers' perception of readability. One possibility would be through some from of user-testing. While this is not planned as part of the first steps, this would be an important external validation. Mike indicated support in the form of consulting how to plan/conduct such an experiment in the future.
    • Mike recommended potential collaborators.

Thanks for your work and the update. A couple of comments/questions below.

Update week 2021-09-27:

  • we identified some important caveats when conducting the experiments; most importantly we would have to account for variability in genre/topics (easy), technical languages (maybe), and cultural differences (not sure how to account for this)

When you say cultural differences, do you mean the same thing as cultural backgrounds (as defined in section 3.7.1.). If yes, sounds good. Stepping back, for every gap that we measure, at some point we need to answer how it will interact with other gaps (to be able to build a composite index). If no, can you expand what you mean by cultural differences?

  • Mike convinced me that, if we were to develop a metric/score for readability that works across projects, we would also want to validate whether this also matches readers' perception of readability. One possibility would be through some from of user-testing. While this is not planned as part of the first steps, this would be an important external validation. Mike indicated support in the form of consulting how to plan/conduct such an experiment in the future.

Is this because the literature does not have a metric that we can confidently pick up in the context of WP? If yes, @Miriam, please keep an eye on whether we should take a similar approach for all other readership gaps (especially those that we will collect data for through surveys).
Martin, when/if you design a set-up for gathering the feedback from readers, I'd like to take a look and provide feedback early on in light of my involvement in the early research on Why We Read Wikipedia and some of the methods used in the first study to gather reader feedback.

Thanks for your work and the update. A couple of comments/questions below.

Update week 2021-09-27:

  • we identified some important caveats when conducting the experiments; most importantly we would have to account for variability in genre/topics (easy), technical languages (maybe), and cultural differences (not sure how to account for this)

When you say cultural differences, do you mean the same thing as cultural backgrounds (as defined in section 3.7.1.). If yes, sounds good. Stepping back, for every gap that we measure, at some point we need to answer how it will interact with other gaps (to be able to build a composite index). If no, can you expand what you mean by cultural differences?

Thanks for pointing out connections to the other gaps in the knowledge gaps taxonomy.
I do think this is in line with the Cultural Background in Section 3.7.1. (for example "Studies on readers from emerging communities found some evidence that local socio-political constructs might impact the way in which Wikipedia’s credibility is perceived" ). It is possible that the readers' perception of "readability" depends on the context, and that different language communities may have different expectations of readability. For example, a common proxy for readability is the use of common words (and avoiding rare words). However, some language communities may prefer low-frequency vocabulary in formal texts and, thus, find articles to be more readable because they use rare vocabulary. Some previous work by Design Research found indications that Indonesian and Arabic readers had different collective perceptions of the quality and usefulness of comparable articles.

  • Mike convinced me that, if we were to develop a metric/score for readability that works across projects, we would also want to validate whether this also matches readers' perception of readability. One possibility would be through some from of user-testing. While this is not planned as part of the first steps, this would be an important external validation. Mike indicated support in the form of consulting how to plan/conduct such an experiment in the future.

Is this because the literature does not have a metric that we can confidently pick up in the context of WP? If yes, @Miriam, please keep an eye on whether we should take a similar approach for all other readership gaps (especially those that we will collect data for through surveys).

Currently, there is no agreed-upon score for multilingual readability in the context of Wikipedia. This additional step would serve as an additional validation, if we think such a score is, in principal, feasible to construct:

  • readability is more subjective than other constructs (than, say demographics)
  • while we would be able to automatically evaluate the metric in a handful of languages, for most languages this is not possible due to lack of annotated datasets.

Martin, when/if you design a set-up for gathering the feedback from readers, I'd like to take a look and provide feedback early on in light of my involvement in the early research on Why We Read Wikipedia and some of the methods used in the first study to gather reader feedback.

This would be extremely valuable and I will keep you in the loop once we start to explore those options. At the moment, the task is centered around the question whether it is, in principle, possible to construct such a metric, i.e. the focus is on technological challenges (which features can be extracted across different language version) and automatic evaluation on annotated datasets (gain initial confidence in the scores). Only, if those steps are successful, it will make sense to consider getting feedback from readers. I do not expect this to be the case before Q4.

Thanks for your detailed response. Sounds good.