Page MenuHomePhabricator

Develop Metrics for the Language Gap: Explore vital article coverage across Wikipedia language editions
Closed, ResolvedPublic

Description

Context:

As part T376728, the following three language gap metrics were proposed for the Knowledge Gap index:

  1. Language representation across projects: which languages have which Wikimedia projects, and level of representation. Similar to canonical wikis.csv, but would also include test projects in the Incubator, Multilingual Wikisource, and Wikiversity Beta; additionally, would include linguistic, population, and geographic information for each language.
  2. Vital article coverage: Wikipedia language versions' coverage of vital articles, articles every Wikipedia should have, and/or topics for impact
  3. Language article coverage: Wikipedia language versions' coverage of articles about own language, related languages, and other relevant languages.

Purpose:

This task focuses on exploring and further developing #2, "Vital article coverage".

This potential dataset will provide an intersection of the language and the topic content gaps.

Analysis:

The analysis will try answer the following question to start with, and further exploration may be conducted based on the data gathered.

  • How does coverage of articles every Wikipedia should have vary by Wikipedia language edition, (per article section), in terms of
    • Article Quantity
    • Article Quality
    • Monthly Pageviews
    • Monthly Revisions

Q3 tasks:

  • Solicit feedback from Research team
  • Exploratory analysis
  • Share exploratory analyses with Community Growth and LPL teams
  • Finalize schema
  • Discuss and determine productionization possibilities

Event Timeline

Methodological Comparison (see notebook)

What is the best way to query the list of 1000 articles every Wikipedia should have?

Conclusion:

Method 1 is more up-to-date than Method 2 for querying the"1000 articles all Wikipedias should have" via the Wikidata Query Service. The list of tagged wikidata items had 27 extra articles and was missing 1 article. After discussing with @Isaac, determined Method 1 (pagepile) is probably more desirable as a stable source of articles that we can version when updating (as opposed to the underlying data slowly shifting month-to-month).

Caveats for pagepile use:

  • May not reflect changes after 2024-10-20. See recent changes to the list.
  • Still will need to modify the sparql query to include all language versions, not just English.

Exploratory Analysis (see notebook)

Of the 1000 articles every Wikipedia should have (henceforth "vital articles"), how many exist on each language version of Wikipedia?

  • Max: all 1000 vital articles
  • Median: 393 vital articles
  • Min: 2 vital articles

Screenshot 2025-03-03 at 10.55.09 AM.png (1×1 px, 276 KB)

CMyrick-WMF changed the task status from Open to In Progress.Mar 14 2025, 5:32 PM
CMyrick-WMF updated the task description. (Show Details)

Schemas updated based on Research team's content gap metric datasets schema for Knowledge Gaps).

The updated schemas are available here and the relevant schemas are copied below:

Schema 2a: Coverage of 1000 articles every Wikipedia should have

  • wiki_db: Wikimedia databased name (e.g., eswiki)
  • time_bucket: Time bucket, with monthly granularity (e.g. “2020-02”)
  • content_gap: Content gap this dataset pertains to (e.g., “topic-gap”)
  • category: The underlying categories for the gap. There will only be one category in this schema: it will be called something like ”vital-articles” or "1000-vital-articles"
  • articles created: Number of articles (from the list of 1000) which have been created, at the time of the time bucket
  • pageviews_sum: Total number of pageviews for the vital articles that Wikipedia has
  • pageviews_mean: Mean number of pageviews for the vital articles that Wikipedia has, at the time of the time bucket
  • revision_count: Total number of edits for the vital articles that Wikipedia has, at the time of the time bucket
  • quality_score: Average article quality score for the vital articles that Wikipedia has, at the time of the time bucket
  • standard_quality: Percentage of vital articles that Wikipedia has (at the time of the time bucket) that satisfy the Standard Quality Criteria
  • standard_quality_count: Number of vital articles that Wikipedia has (at the time of the time bucket) that satisfy the Standard Quality Criteria

Schema 2b: Coverage and status of each of the 1000 articles every Wikipedia should have

Same underlying data as in schema above, but at the level of the wikidata item associated with each article (i.e., the content_gap field will have the 1000 QIDs for the 1000 items associated with each of the 1000 articles every Wikipedia should have)

  • wiki_db: Wikimedia databased name (e.g., eswiki)
  • time_bucket: Time bucket, with monthly granularity (e.g. “2020-02”)
  • content_gap: Content gap this dataset pertains to (e.g., “topic-gap”)
  • category: The underlying categories for the gap; there will be 1000 categories (one QID for each item associated with an article every wikipedia should have)
  • articles created: Number of articles from the category which have been created, at the time of the time bucket. Because each category only contains individual QIDs,
    • “1” will indicate that the associated article has been created in that wiki, at the time of the time bucket
    • “0” will indicate that the associated article has not been created in that wiki, at the time of the time bucket
  • pageviews_sum: total number of pageviews for each category (i.e. total number of pageviews for each article associated with each QID) at the time of the time bucket
  • revision_count: Total number of edits for each category (i.e. total number of edits for each article associated with each QID) at the time of the time bucket
  • quality_score: Article quality score for each category (i.e. article quality score for each article associated with each QID) at the time of the time bucket
  • standard_quality_count: Number of vital articles that Wikipedia has (at the time of the time bucket) that satisfy the Standard Quality Criteria. Because each category only contains individual QIDs,
    • “1” will indicate that the associated article satisfies the Standard Quality Criteria.
    • “0” will indicate that the associated article does not satisfy the Standard Quality Criteria.
CMyrick-WMF updated the task description. (Show Details)

Regarding productionization, we now an engineering request: T390104