Page MenuHomePhabricator

Prototype misalignment API
Open, Needs TriagePublic

Description

Build API for article misalignment. This will have a few underlying components:

  • Language-agnostic quality model: map any Wikipedia article to [0-1] score where 0 = no content and 1 ~ featured article.
  • Language-agnostic importance model: per last quarter's work (T272175#6894768), use pageviews as basic proxy for importance while other more flexible filters (e.g., topic, country, occupation) are developed. Normalize to [0-1] range.
  • Language-agnostic misalignment model: difference between quality and importance.
  • Language-agnostic misalignment metric: summarization of intensity of misalignment in a project.

The API then has two pieces:

  • Precomputed scores for all wikis
  • Function for computing article misalignment for single articles on the fly

Event Timeline

Weekly updates:

  • wrote up quality model description: https://meta.wikimedia.org/wiki/Research:Prioritization_of_Wikipedia_Articles/Language-Agnostic_Quality
  • tested end-to-end pipeline for generating misalignment scores for articles but am going to split it into its individual components so it's easier to extract the data from intermediate steps for other purposes
  • so far, the entire approach to quality and demand has been wiki-specific and as a result, e.g., a top-quality article in Simple Wikipedia is far smaller than a top-quality article in English Wikipedia and a top-demand article in Simple Wikipedia gets far fewer pageviews than a top-demand article in English Wikipedia. This makes plenty of sense though reduces the comparability of resulting quality/demand/misalignment scores across languages. However, I'm considering whether there should be some language-agnostic thresholds -- e.g., a top quality article will have at least 10 sections regardless of language and maybe this number can go higher in certain languages. Same for images and references. Harder for page length because it depends heavily on language whether you need e.g., 100 bytes or 1000 bytes, but perhaps some basic minimum threshold could still be set.