Page MenuHomePhabricator

Modify current quality metrics
Open, Needs TriagePublic

Description

Improve the current quality score metric by:

  • Aggregating revision quality per month by just considering the quality of the last revision of the month
  • Adding another metric view with the % of articles in a group that are above a certain threshold:
> 0.36 (Stub+);
> 0.54 (C+)
> 0.65 (B+)
> 0.78 (GA+)
> 0.88 (FA)

Thresholds obtained as follows: the upper limit of each quality class is the median predicted quality score of revisions corresponding to such class as labeled by editors.

Event Timeline

FYI, I computed the monthly median quality of the last revision available of each article (I mean "available" because many articles might not have been updated in a given month). Data is stored at riskobservatory.predicted_quality_last_revisions and exposed in the v2 of the Risk Observatory https://superset.wikimedia.org/superset/dashboard/p/GLzB1ZorMdy (I also computed the yearly quartiles to expose this data with boxplots https://superset.wikimedia.org/superset/dashboard/p/3j5vK0DBYzl)

@Fabian let's put this on hold as I learnt that scores are comparable across languages. We should probably think of a way of getting thresholds relative to each language.

Updates after a chat with @Isaac:

  • Scores are in a way comparable across languages: for example, a score of 0.5 for an article in itwiki means that the article looks like the average article in itwiki in terms of quality. So articles in different languages with quality=0.5 might have different quality grades in absolute scale, but they all have the "average quality" for their own wiki. Another example, if in a wiki, the majority of articles is stubs, articles that are “good stubs” will get a score higher than 0.5. This will allow a flexible metric and avoid mapping all wikis to English Wiki standards, while still allowing us to say things like: this month, articles of higher quality than average increased by x%. This means that we need to choose thresholds according to what levels of quality we want to see, relative to each wiki.
  • If we want to really score all articles with the same system, it is still possible. We could use the english model to predict article quality in all languages. Then, using the thresholds above, we could measure how many of the articles in all wikis look like Stub/Start/C/B/GA/FA according to enwiki standards.

From @Isaac too: Finally we could do something like above but instead of trying to choose a quality class and explain that, just use the model to set basic standards around the features we care about -- e.g., # of articles with at least 5 references, 2 sections, an infobox, an image, and 3 categories, or something like that. that's way more transparent.

@Fabian, with the help of @Pablo and @Isaac, we have defined a new set of rules to create a new "global quality score" that reflects a minimum acceptable quality across Wikipedias.

We look at the association between page properties (length, number of images, references, etc.) and community-assigned quality labels from 5 wikis – hungarian, turkish, english, french and arabic. Our initial recommendation is that an article is of acceptable quality if it meets at least 5 of the 6 following criteria:

  • It should be at least 8kB long in size
  • It should have at least 1 category
  • It should have at least 7 sections
  • It should be illustrated with 1 or more images
  • Its references should be at least 4
  • It should have 2 or more intra wiki links.

In practice, part of the code to calculate this is in here, in the section "Relaxing Constraints": https://hub-paws.wmcloud.org/user/Miriam_%28WMF%29/lab/tree/analyze_standard_quality.ipynb (allsample is the dataframe with quality features from all wikis).
The final metric would be the % of articles that meet at least 5 or the 6 criteria above.

How long do you think would it take to compute this new metric based on existing features? If this is generally feasible, I will change the description of the task with this new metric details.