Page MenuHomePhabricator

Initiate Multilingual Readability Research (Q2)
Closed, ResolvedPublic


We have identified two different multilingual approaches to measuring readability of Wikipedia articles in T288239. For details see:

During Q2 the aim is:

  • generate multilingual datasets with annotated labels for training and evaluation
  • exploratory analysis of datasets and implementation of baseline measures
  • explore options to add collaborators or scope out subtasks as internship projects

Event Timeline

MGerlach moved this task from Staged to FY2021-22-Research-Oct-Dec on the Research board.
MGerlach removed a project: Epic.

Update week 2021-10-11:

  • wrote specifications for working on some subtasks as part of an internship (see this googledoc) and shared with Emily

Update week 2021-10-18:

  • discussed several options for potential candidates for interns with the team.

Update week 2021-10-25:

  • screening for internship candidates and starting discussions

Update week 2021-11-04:

  • reaching out and discussing with potential interns

Update week 2021-11-15:

  • discussed with potential intern. starting to formalize contract.

Update 2021-12-13:

  • there are some administrative blockers to set up the contract. trying to figure out with intern and HR what can be done.

Update 2021-12-20:

  • completed pipeline to generate datasets (code available in )
    • simple-en-wiki: large but only in English, all articles that are common in simplewiki and enwiki with their extracted text. final dataset contains 184k articles
    • viki-wiki: small but multilingual data, few thousand articles that are common in vikidia and wikipedia. the english dataset contains 3200 articles.
MGerlach updated the task description. (Show Details)

Finalized generating the multilingual dataset (12 languages):

Moving exploratory analysis to next quarter's work.