Page MenuHomePhabricator

Modify current quality metrics
Closed, ResolvedPublic

Description

Improve the current quality score metric by:

  • Aggregating revision quality per month by just considering the quality of the last revision of the month
  • Adding another metric view with the % of articles in a group that are above a certain threshold:
> 0.36 (Stub+);
> 0.54 (C+)
> 0.65 (B+)
> 0.78 (GA+)
> 0.88 (FA)

Thresholds obtained as follows: the upper limit of each quality class is the median predicted quality score of revisions corresponding to such class as labeled by editors.

Event Timeline

FYI, I computed the monthly median quality of the last revision available of each article (I mean "available" because many articles might not have been updated in a given month). Data is stored at riskobservatory.predicted_quality_last_revisions and exposed in the v2 of the Risk Observatory https://superset.wikimedia.org/superset/dashboard/p/GLzB1ZorMdy (I also computed the yearly quartiles to expose this data with boxplots https://superset.wikimedia.org/superset/dashboard/p/3j5vK0DBYzl)

@Fabian let's put this on hold as I learnt that scores are comparable across languages. We should probably think of a way of getting thresholds relative to each language.

Updates after a chat with @Isaac:

  • Scores are in a way comparable across languages: for example, a score of 0.5 for an article in itwiki means that the article looks like the average article in itwiki in terms of quality. So articles in different languages with quality=0.5 might have different quality grades in absolute scale, but they all have the "average quality" for their own wiki. Another example, if in a wiki, the majority of articles is stubs, articles that are “good stubs” will get a score higher than 0.5. This will allow a flexible metric and avoid mapping all wikis to English Wiki standards, while still allowing us to say things like: this month, articles of higher quality than average increased by x%. This means that we need to choose thresholds according to what levels of quality we want to see, relative to each wiki.
  • If we want to really score all articles with the same system, it is still possible. We could use the english model to predict article quality in all languages. Then, using the thresholds above, we could measure how many of the articles in all wikis look like Stub/Start/C/B/GA/FA according to enwiki standards.

From @Isaac too: Finally we could do something like above but instead of trying to choose a quality class and explain that, just use the model to set basic standards around the features we care about -- e.g., # of articles with at least 5 references, 2 sections, an infobox, an image, and 3 categories, or something like that. that's way more transparent.

@Fabian, with the help of @Pablo and @Isaac, we have defined a new set of rules to create a new "global quality score" that reflects a minimum acceptable quality across Wikipedias.

We look at the association between page properties (length, number of images, references, etc.) and community-assigned quality labels from 5 wikis – hungarian, turkish, english, french and arabic. Our initial recommendation is that an article is of acceptable quality if it meets at least 5 of the 6 following criteria:

  • It should be at least 8kB long in size
  • It should have at least 1 category
  • It should have at least 7 sections
  • It should be illustrated with 1 or more images
  • Its references should be at least 4
  • It should have 2 or more intra wiki links.

In practice, part of the code to calculate this is in here, in the section "Relaxing Constraints": https://public-paws.wmcloud.org/User:Miriam_(WMF)/analyze_standard_quality.ipynb (allsample is the dataframe with quality features from all wikis).
The final metric would be the % of articles that meet at least 5 or the 6 criteria above.

How long do you think would it take to compute this new metric based on existing features? If this is generally feasible, I will change the description of the task with this new metric details.

leila triaged this task as Medium priority.Apr 4 2023, 7:56 PM
leila raised the priority of this task from Medium to High.

Created an initial implementation of a 'standard quality' as feature metric of the knowledge gap pipelines (MR)

The standard quality flag is available for all historical revisions, e.g. for the article on the Arc de triomphe on fr wiki:

image.png (424×1 px, 41 KB)

The MR also takes a first stab at forward filling missing data for quality scores, but this is more subtle than expected. For distributed compute we prefer to operate on sparse data, e.g. if there is no revision to a given article in a given month, instead of counting a 0 we prefer to simply leave that datapoint out. For the reader/contributor based metrics that makes sense since it measures activity, however for content based metrics like quality scores we want to know about the state of an article in a given month even if there has been no other events that would "cause" that article to appear in the data. The forward fill implemented in the MR is not sufficient as we can't forward fill a missing article score (a NaN value) if the row itself is not present in the dataset. Due to the nature of the nested aggregation done for content gaps pipeline (the metrics feature are aggregated, then re-aggregated for each content gap) this means we need to make the metric features non-sparse as a whole. I will implement/test this next week.

Resolving this task as the metric has been released and internally shared