Page MenuHomePhabricator

Text complexity scoring
Closed, DeclinedPublic


Design an AI that flags complicated articles or complicated sections of articles -- perhaps using reading grade level or some such.

Wiki thing it helps with:

  • Make our information available to our readers, who often don't understand what they're reading
  • Helps editors identify articles that could use either clearer into/summary sections, or perhaps need two articles (one simplified, one with all the complex science/math/etc)

Things that might helps us get this AI built:

  • Readability measures (LIX, Flesch–Kincaid etc)
  • Lists of specialized terms not used by the general public

Event Timeline

So I'm thinking that we could have a scorer that, given a rev_id, can return a complex JSON document of scores using a library like textstat.

Here's what I'd expect to get back from a query for a recent revision of the article :en:Waffle.

  "grade": 8.1,
  "words": 3073,
  "sections": [
    {"level": 0, "words": 203, "grade": 6.8},
    {"level": 1, "words": 231, "grade": 9.5},
    {"level": 1, "words": 150, "grade": 9.0},
    {"level": 2, "words": 350, "grade": 5.2},
    {"level": 2, "words": 275, "grade": 7.3},
    {"level": 2, "words": 150, "grade": 8.1},
    {"level": 2, "words": 83, "grade": 9.5},
    {"level": 1, "words": 102, "grade": 9.5},
    {"level": 2, "words": 299, "grade": 9.5},
    {"level": 2, "words": 324, "grade": 9.5},
    {"level": 1, "words": 97, "grade": 9.5},

This format uses an array for "sections" assuming that the first item is section 0 (the lead) and proceeds from there.

This would be pretty easy to put together, I think.

Yeah. I think so. There's a well defined library for doing it. We might set our thresholds differently for "Easy" on the #revision-scoring-as-a-service team. If you'd like to pick it up, I'd be really happy to talk to you about it. FWIW, this would be immensely easier than implementing a stochastic prediction model.

The textstat package looks like a good idea for English. For other languages this might be a bit more difficult to use. Maybe using overall word frequency within a language (wikipedia version) can be used to determine the how complex the terms used in an article are (how many % of the article is top-1000 words, how many top-10000, how many top 100000, and how many outside of that).

I worked with @Tdcan at the Wikimedia Hackathon to build

We'll probably want to extend this to include other text readability scores before deploying in production, but for now, you can test out the model.

We should add more scores before we call this done.

Hello @Halfak. I am a new contributor to wikimedia and would love to help out with this issue. What are the restrictions on the model to be used to detect complexity and do you wish to have a regression based model or a classifier?


Hello @Chtnnh! We've had a group of newcomers take this task on once before. We were able to get their work deployed for a period of time, but I think we need to do some more work to make text complexity scoring useful on its own.

But I think we can do a lot with adding text complexity to our article quality models. I created a task for you with a bunch of detail about how I'd approach the problem. See T246438: Add text complexity scoring to article quality models. Let me know if that would interest you.

@Halfak What happened to the solutions deployed before? Were they ineffective in solving the issue or were there other constraints?

The AI posed in the original question could be replaced by a script that parses JSON outputs given by other NLP models that we can deploy thus abstracting the process of flagging a complicated article further from the NLP models.

The benefits of this are two fold. Firstly, It makes for easier NLP model coding and replacement. Secondly, it separates the two tasks and clearly demarcates responsibilities for new contributors.

What do you think?

As far as T246438 is concerned, I would love to help out with that and see through its deployment.

The biggest limitation of the solution deployed previously is that it didn't have a clear use-case. E.g. let's say that an article scores an 8.3 flesch reading ease. What does that mean for the article? Is it too high? Too low? Is it merely an artifact of the article's general topic space? The nice thing about incorporating these signal sources into our article quality models is that the model can work out these details for us and give us actionable feedback that could direct work.

Right, so what we are doing in T246438 is essentially incorporating complexity into article quality, leaving this task redundant.

@Halfak Should we mark this task as duplicate?

I think we should decline this as it doesn't look like we want to deploy this. But we would like to do something different with T246438.