Text complexity scoring
Open, LowPublic

Description

Design an AI that flags complicated articles or complicated sections of articles -- perhaps using reading grade level or some such.

Wiki thing it helps with:

  • Make our information available to our readers, who often don't understand what they're reading
  • Helps editors identify articles that could use either clearer into/summary sections, or perhaps need two articles (one simplified, one with all the complex science/math/etc)

Things that might helps us get this AI built:

  • Readability measures (LIX, Flesch–Kincaid etc)
  • Lists of specialized terms not used by the general public
Halfak created this task.Jan 20 2017, 6:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 20 2017, 6:55 PM

So I'm thinking that we could have a scorer that, given a rev_id, can return a complex JSON document of scores using a library like textstat.

Here's what I'd expect to get back from a query for a recent revision of the article :en:Waffle. https://ores.wikimedia.org/v2/scores/enwiki/fleschkincade/759009699

{
  "grade": 8.1,
  "words": 3073,
  "sections": [
    {"level": 0, "words": 203, "grade": 6.8},
    {"level": 1, "words": 231, "grade": 9.5},
    {"level": 1, "words": 150, "grade": 9.0},
    {"level": 2, "words": 350, "grade": 5.2},
    {"level": 2, "words": 275, "grade": 7.3},
    {"level": 2, "words": 150, "grade": 8.1},
    {"level": 2, "words": 83, "grade": 9.5},
    {"level": 1, "words": 102, "grade": 9.5},
    {"level": 2, "words": 299, "grade": 9.5},
    {"level": 2, "words": 324, "grade": 9.5},
    {"level": 1, "words": 97, "grade": 9.5},
    ...
  ]
}

This format uses an array for "sections" assuming that the first item is section 0 (the lead) and proceeds from there.

This would be pretty easy to put together, I think.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptJan 20 2017, 8:35 PM
Halfak triaged this task as Low priority.Jan 26 2017, 3:54 PM
DatGuy added a subscriber: DatGuy.Jan 26 2017, 11:42 PM

Is this really 'easy'?

Yeah. I think so. There's a well defined library for doing it. We might set our thresholds differently for "Easy" on the Scoring-platform-team team. If you'd like to pick it up, I'd be really happy to talk to you about it. FWIW, this would be immensely easier than implementing a stochastic prediction model.

Basvb added a subscriber: Basvb.Apr 19 2017, 3:05 PM
Basvb added a comment.Apr 19 2017, 3:08 PM

The textstat package looks like a good idea for English. For other languages this might be a bit more difficult to use. Maybe using overall word frequency within a language (wikipedia version) can be used to determine the how complex the terms used in an article are (how many % of the article is top-1000 words, how many top-10000, how many top 100000, and how many outside of that).

Trizek-WMF added a subscriber: Trizek-WMF.
Trizek-WMF removed a subscriber: Trizek-WMF.
Halfak added a subscriber: Tdcan.May 29 2017, 4:04 PM

I worked with @Tdcan at the Wikimedia Hackathon to build https://github.com/wiki-ai/flesch_complexity

We'll probably want to extend this to include other text readability scores before deploying in production, but for now, you can test out the model. https://ores.wmflabs.org/v2/scores/enwiki/flesch

We should add more scores before we call this done.