Text complexity is associated with quality. I.e., a Wikipedia article should have the right amount of complexity to maximize quality. Too much or too little can signal a poorly written section.
In this task, let's experiment with adding features to the article quality model that would allow us to score the complexity of text in an article. Then let's rebuild the article quality models and see if we get a fitness boost.
A primer on feature engineering in ORES/revscoring is here: https://github.com/wikimedia/revscoring/blob/master/ipython/feature_engineering.ipynb
We define the features for articlequality here: https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/enwiki.py
We might want to add something for breaking an article into sections here: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/parsed.py
It looks like we can use the get_sections() method of mwaparserfromhell: https://mwparserfromhell.readthedocs.io/en/latest/api/mwparserfromhell.html#mwparserfromhell.wikicode.Wikicode.get_sections
We probably want to use a library like https://pypi.org/project/textstat/
I think we'll want something like this:
import textstat from revscoring.datasources import revision_oriented as ro from revscoring.datasources.meta import mappers from revscoring.features.meta import aggregators from revscoring.features import wikitext def process_flesch(text): if text is not None and len(text) >= 100: return textstat.flesch_reading_ease(text) else: return None def clean_section(section): return str(section.strip_code()) section_strs = mappers.map(clean_section, wikitext.revision.datasources.sections) section_flesches = filters.not_none(mappers.map(process_flesch, section_strs)) text_flesch = Feature("wikitext.revision.text.flesch", process_flesch, depends_on=[ro.revision.text]) min_section_flesch = aggregators.min(section_flesches, name="wikitext.revision.sections.min_flesch") max_section_flesch = aggregators.max(section_flesches, name="wikitext.revision.sections.max_flesch") mean_section_flesch = aggregators.mean(section_flesches, name="wikitext.revision.sections.mean_flesch") text_complexity = [ text_flesch, min_section_flesch, max_section_flesch, mean_section_flesch, min_section_flesch - text_flesch, max_section_flesch - text_flesch, mean_section_flesch - text_flesch ]