Page MenuHomePhabricator

Add text complexity scoring to article quality models
Open, LowPublic

Description

Text complexity is associated with quality. I.e., a Wikipedia article should have the right amount of complexity to maximize quality. Too much or too little can signal a poorly written section.

In this task, let's experiment with adding features to the article quality model that would allow us to score the complexity of text in an article. Then let's rebuild the article quality models and see if we get a fitness boost.

A primer on feature engineering in ORES/revscoring is here: https://github.com/wikimedia/revscoring/blob/master/ipython/feature_engineering.ipynb

We define the features for articlequality here: https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/enwiki.py

We might want to add something for breaking an article into sections here: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/parsed.py

It looks like we can use the get_sections() method of mwaparserfromhell: https://mwparserfromhell.readthedocs.io/en/latest/api/mwparserfromhell.html#mwparserfromhell.wikicode.Wikicode.get_sections

We probably want to use a library like https://pypi.org/project/textstat/

I think we'll want something like this:

import textstat
from revscoring.datasources import revision_oriented as ro
from revscoring.datasources.meta import mappers
from revscoring.features.meta import aggregators
from revscoring.features import wikitext

def process_flesch(text):
  if text is not None and len(text) >= 100:
    return textstat.flesch_reading_ease(text)
  else:
    return None

def clean_section(section):
  return str(section.strip_code())

section_strs = mappers.map(clean_section, wikitext.revision.datasources.sections)
section_flesches = filters.not_none(mappers.map(process_flesch, section_strs))

text_flesch = Feature("wikitext.revision.text.flesch", process_flesch, depends_on=[ro.revision.text])
min_section_flesch = aggregators.min(section_flesches, name="wikitext.revision.sections.min_flesch")
max_section_flesch = aggregators.max(section_flesches, name="wikitext.revision.sections.max_flesch")
mean_section_flesch = aggregators.mean(section_flesches, name="wikitext.revision.sections.mean_flesch")

text_complexity = [
  text_flesch,
  min_section_flesch,
  max_section_flesch,
  mean_section_flesch,
  min_section_flesch - text_flesch,
  max_section_flesch - text_flesch,
  mean_section_flesch - text_flesch
]

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Chtnnh subscribed.

@Halfak It might take me some time to get a familiarity with the wikimedia code already used here, but I am on this. I will reach out to you here in case I hit a road block.

Thanks,
Chaitanya

Sounds great! If you do IRC, please join us in #wikimedia-ai on Freenode. Our team and the volunteers who work with us generally hang out there.

Here's an example of how you can manually run a "parsed wikitext" datasource from the revscoring repository:

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from revscoring.features import wikitext
>>> from revscoring.dependencies import solve
>>> from revscoring.datasources import revision_oriented as ro
>>> solve(wikitext.revision.datasources.templates, cache={ro.revision.text: "Foo bar {{baz|key=value}}."})
['{{baz|key=value}}']

Screenshot from 2020-02-29 00-29-50.png (1×1 px, 265 KB)

I have double checked the path, but the error persists.

The get_sections() method is specified in the Wikicode class in wikicode.py in the mwparserfromhell repo.

Please let me know where I am going wrong here.

Screenshot from 2020-02-29 00-29-50.png (1×1 px, 265 KB)

I have double checked the path, but the error persists.

The get_sections() method is specified in the Wikicode class in wikicode.py in the mwparserfromhell repo.

Please let me know where I am going wrong here.

get_sections is defined like strip_code in the parser library. If you try something similar to how strip_code is called, it should work:

self.sections = execute_method(
            "get_sections", self.wikicode,
            name=self._name + ".content"
        )
This comment was removed by Chtnnh.

@Halfak I think the code you suggested might be better placed in a file other than enwiki.py, as it contains only the features and no method definitions. Maybe I can add the methods to revscoring/features and then call them in enwiki.py

@Sumit Hi! I was just thinking of something along the lines of what you suggested. I have coded it out. Thanks so much for your help.

Have a look

Screenshot from 2020-02-29 12-58-59.png (1×1 px, 145 KB)

Please test run your solutions locally. If it runs and gives expected results, submit a PR and it can be reviewed, if it doesn't seek help regarding the error. Screenshot of a code doesn't give much context to comment on.

This comment was removed by Chtnnh.

@Sumit Thanks for the suggestion! No errors thrown, PR has been submitted

@Halfak whats left on this task, I have submitted a PR for you review.

https://github.com/wikimedia/articlequality/pull/106

Please do have a look and let me know.

We talked about this in IRC, but it looks like performance for this feature is pretty bad. I don't think that is due to textstat. I think it's mwparserfromhell pulling out the sections. See https://gist.github.com/halfak/d3a9635791f98e0105302b7dfd2ca117 for my tests.

To calculate the Flesch readability index of text, we employed three different modules: textstat, textatisitic and py-readibility-measures. Among the three textstat gives the best performance, as indicated by the results below. The results for textatistic are not included because it was throwing errors.

@Halfak we are going forward with textstat?

Wikitext parsing took 0.21138763427734375 seconds
Parsing 28 sections took 0.1638188362121582 seconds
[-31.94, -124.75, 58.96, 32.71285714285714, -92.81, 90.9, 64.65285714285714]
Processing took 1.327301025390625 seconds
Extracting Sections took 0.21630120277404785 seconds
Mapping flesch reading ease to sections took 0.02088451385498047 seconds
Not None function took 0.00035452842712402344 seconds
Feature took 8.58306884765625e-06 seconds
Textstat processing too 9.298324584960938e-06 seconds
py-r took 0.6828279495239258 seconds
feature.wikitext.revision.text.flesch -31.94
Processing took 6.175041198730469e-05 seconds
feature.wikitext.revisions.sections.min_flesch -124.75
Processing took 0.1640794277191162 seconds
feature.wikitext.revisions.sections.max_flesch 58.96
Processing took 0.11082124710083008 seconds
feature.wikitext.revisions.sections.mean_flesch 32.71285714285714
Processing took 0.18544840812683105 seconds
feature.(wikitext.revisions.sections.min_flesch - wikitext.revision.text.flesch) -92.81
Processing took 0.18747711181640625 seconds
feature.(wikitext.revisions.sections.max_flesch - wikitext.revision.text.flesch) 90.9
Processing took 0.2514066696166992 seconds
feature.(wikitext.revisions.sections.mean_flesch - wikitext.revision.text.flesch) 64.65285714285714
Processing took 0.18683099746704102 seconds

@Halfak Apart from writing the report on this task, is there anything left for us to do? Can we mark this task as resolved?

Halfak triaged this task as Low priority.May 4 2020, 5:03 PM
Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!