Page MenuHomePhabricator

Identify a set of articles for which a simplified version might be most useful
Closed, ResolvedPublic

Description

We would like to put together a list of articles where automatically generated simplified versions would be most useful (say 100-1000 articles).

Criteria for most useful are:

  • difficult to read. This can be articles with a high readability score via the multilingual readability model
  • a simple version does not exist in Simple Wikipedia
  • long text (specifically the lead section)
  • many pageviews (high demand)
  • few/no edits (not controversial, no recent events, etc)

Notes:

  • It makes sense to start with English Wikipedia first. But the same analysis can be easily expanded to other languages as well (Simple Wikipedia only exists for English).

Event Timeline

weekly update:

  • started data collection for articles from English Wikipedia
  • thinking through other potential metrics to consider; consulting with Design Research on previous insights

Weekly update:

  • Identified 3 potential approaches for prioritization. I started to explore those with articles in enwiki.

Topics:
Some topics are generally more difficult to read/understand than others. In fact, I found that readability of articles varies substantially across topics when calculating the Flesch-Kincaid grade level for all articles and calculating averages across the set of 64 topics from the article-topic model. This suggests to prioritize articles from topics that are more difficult to read.

  • average readability score for all articles (FKGL): 11.7
  • highest readability scores (difficult to read): mostly STEM (e.g. Physics, Mathematics, Medicine&Health) (FKGL>14)
  • low readability scores (easy to read): Culture and Geography (e.g. Sports, Music, Geographical) (FKGL<11)
  • see below for full list

difficult to read / high pageviews / few edits:
On an article level, we could identify articles for which we might anticipate a need for a simpler version if it is i) difficult to read, and ii) has many pageviews, and iii) has few/no editing activity (i.e. dont seem controversial). Specifically, I consider the following metrics for prioritization

  • readability score (FKGL)
  • page-length (# characters of the lead section)
  • whether a corresponding article exists in Simple English Wikipedia
  • number of pageviews in a given month (example January 2025)
  • number of edits in a given month (example January 2025)

Fixing the following criteria:

  • top-5 percentile in terms of pageviews
  • no edits
  • a reasonable length (i.e. between 1,000 and 100,000 characters)
  • no existing article in Simple English Wikipedia
  • no list- or disambiguation pages

This yields overall 25,478 articles. From these, one can select different subsets of articles that can be considered difficult to read

  • at least slightly difficult to read (FKGL>12): 15,861 articles
  • at least difficult to read (FKGL>15): 5,087 articles
  • at least very difficult to read (FKGL>18): 1,085 articles

Maintenance templates
Another option is to consider articles which are marked with relevant templates. Specifically, in enwiki I found two templates which could indicate a need for simplification:

  • {{Confusing}} "This article may be confusing or unclear to readers."
  • {{Technical}}: "This article may be too technical for most readers to understand."

Next step: Identify the set of articles marked with these templates

Appendix:

  • List of average readability scores for different topics (articles from enwiki)
TopicFKGL
Culture.Linguistics15.177529
STEM.Physics14.566889
STEM.Mathematics14.334808
STEM.Medicine_&_Health14.246074
STEM.Libraries_&_Information13.900703
STEM.Computing13.766753
STEM.Chemistry13.483539
History_and_Society.Society13.444819
STEM.Technology13.304082
History_and_Society.Politics_and_government13.246235
History_and_Society.Business_and_economics13.168493
Culture.Media.Software13.143267
History_and_Society.Education12.589726
History_and_Society.Military_and_warfare12.531219
STEM.Earth_and_environment12.513622
Culture.Philosophy_and_religion12.350245
Geography.Regions.Europe.Southern_Europe12.231398
Geography.Regions.Americas.South_America12.180362
STEM.Engineering12.124312
STEM.STEM*12.035751
Geography.Regions.Asia.Southeast_Asia12.015204
Culture.Visual_arts.Fashion12.012407
Culture.Performing_arts11.99525
Culture.Internet_culture11.994836
Culture.Media.Entertainment11.975434
Geography.Regions.Africa.Eastern_Africa11.951983
Culture.Media.Television11.951682
Geography.Regions.Europe.Eastern_Europe11.839796
Culture.Literature11.838232
Geography.Regions.Americas.Central_America11.81081
Geography.Regions.Americas.North_America11.783516
Culture.Media.Video_games11.778382
Geography.Regions.Africa.Africa*11.765621
Culture.Biography.Biography*11.736714
Culture.Visual_arts.Comics_and_Anime11.72916
Culture.Biography.Women11.716425
Geography.Regions.Africa.Southern_Africa11.714956
History_and_Society.History11.657674
Geography.Regions.Africa.Western_Africa11.628558
Culture.Visual_arts.Visual_arts*11.622781
STEM.Space11.561805
Geography.Regions.Africa.Northern_Africa11.55681
Culture.Food_and_drink11.471398
Geography.Regions.Oceania11.447479
Geography.Regions.Asia.Asia*11.446472
Geography.Regions.Asia.North_Asia11.436199
Geography.Regions.Europe.Europe*11.432898
Geography.Regions.Africa.Central_Africa11.408523
Culture.Visual_arts.Architecture11.394815
Geography.Regions.Asia.South_Asia11.368315
Geography.Regions.Asia.West_Asia11.366012
History_and_Society.Transportation11.299748
Culture.Media.Media*11.241043
Culture.Media.Books11.231328
Geography.Regions.Asia.East_Asia11.173024
Culture.Media.Radio11.080318
Geography.Regions.Asia.Central_Asia11.062099
Geography.Regions.Europe.Northern_Europe11.006349
Culture.Media.Films10.963442
Geography.Regions.Europe.Western_Europe10.877386
STEM.Biology10.807183
Culture.Media.Music10.778768
Culture.Sports10.710971
Geography.Geographical10.533587

Weekly update:

  • Performed detailed analysis of 3rd option using Maintenance templates
  • Identified all articles in English Wikipedia using the template {{Confusing}} or {{Technical}} (or any if its redirects) in the lead section of the article (some articles contain these templates only in specific sections; those were discarded in this case).
  • this yields 3708 articles
  • the average readability (FKGL) of those papers is 14.0. This is substantially higher than the average readability of all articles (11.7)
  • Thus, the two selected maintenance templates seem promising options to identify articles that could benefit from simplification.

Next step: writing up findings about different options for prioritization and adding code to the repo.

weekly update:

weekly update:

  • updated analysis to remove disambiguation/list pages; these articles are not relevant for simplification since they just contain lists of links to other articles. There are many of these (e.g. 300K disambiguation pages in enwiki alone) and generally tend to have high FKGL scores skewing the overall stats.
  • generated list of 1000 example articles for each of the three approaches (spreadsheet)
  • shared results with @ovasileva: the list of articles are a starting point for potential discussions with communities about the problem of difficult-to-read articles

Closing the task as resolved as the initial goal of this analysis is completed. If there are follow-up questions, these will be tracked in other tasks.