Page MenuHomePhabricator

Develop Metrics for the Language Gap: Propose language metric(s) to be included in knowledge gap index
Closed, ResolvedPublic

Description

Q2 Goal: Based on current language metrics compiled thus far, make recommendations for metric(s) to include in the Knowledge Gaps Index metrics (i.e., per wiki; distribution across buckets)

Context:

Event Timeline

CMyrick-WMF changed the task status from Open to In Progress.Oct 8 2024, 4:14 PM
CMyrick-WMF renamed this task from Develop Metrics for the Language Gap: Prioritize language metrics to be included in knowledge gap index to Develop Metrics for the Language Gap: Propose language metric(s) to be included in knowledge gap index.Oct 8 2024, 4:32 PM
CMyrick-WMF updated the task description. (Show Details)

Background

From A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft), within the READERS section:

3.1.4 Language. The language gap reflects the different levels of readership depending on readers’ ability to read one or more languages. What languages an individual can read greatly impacts what content is available to them and can introduce greater barriers if they are forced to read content in a language that is less familiar to them. Surveys have been conducted to estimate readers’ literacy [141, 85, 52, 21] suggesting that certain languages have highly-literate readers. For example, languages that are specific to one country show high levels of literacy amongst readers. In contrast, other languages such as English or French, which are more strongly associated with colonialism, have many readers for which English / French is not their native language [141]. In order to address this issue, in English Simple Wikipedia was introduced using a simpler grammar and a limited vocabulary. While improving readability in comparison to English Wikipedia, research has shown that its level is still not ideal for readers with limited language literacy [36]. Other initiatives attempting to bridge this gap aim at making access to content in one’s local language by growing under-represented languages such as Scribe [237], the GapFinder tool [221], Content Translation [218], or the Growing Local Language Content on Wikipedia initiative [223].

(p.11)

From A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft), within the CONTRIBUTORS section:

4.1.4 Language. The language gap is the difference between an individual’s fluency in a language and how likely they are to contribute to Wikimedia sites. Surveys have been conducted to estimate contributors’ literacy or language skills [21, 52, 79, 131, 69, 119] and the Babel system [255] is widespread on user talk pages and offers an alternative to understanding the fluency of contributors. Though it may feel intuitive that fluency would be required to contribute, lowering the barrier to contribution by lower-fluency individuals can be important for effective patrolling in small wikis [155], increase the diversity of contributors, and allow for the cross-pollination of content that might otherwise remain locked up in other languages [55]. Many editors are multilingual and contribute to Wikipedia in a variety of languages, with small wikis heavily depending on multilingual editors and English the most common second-language outside of one’s native language [69, 55, 52]. While reducing language barriers is important, it also brings risks of larger communities overshadowing the contributions of more local contributors as happened recently with Scots Wikipedia [171]. Tools like Scribe [237] have sought to address the language gap by making it easier to contribute in one’s own language even when there are not easier approaches to writing articles like Content Translation [218] available.

(p.17)

From A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft), within the CONTENT section:

5.1.4 Language. The language gap refers to the difference in content coverage across different languages. While each Wikipedia language edition is a stand-alone project, with different size and coverage of relevant topics [88, 30], other projects such as Wikidata and Wikimedia commons are multilingual by design. However, while Wikimedia Commons is used across many languages [122], its captions and descriptions area available mainly in English. Wikidata’s labels are also nonuniformly distributed across languages, with only 11 languages holding almost 50% of all language knowledge in Wikidata, English being one of the most prominent ones [103]. Projects such as Structured Data on Commons [238] and Suggested Edits [230] aimed at rehauling the projects’ interface to make the translation efforts on Commons easier and more effective.

(p.23)

Weekly update:

  • Scheduled time on Monday to discuss with Knowledge Gaps team.
Miriam triaged this task as Medium priority.Nov 20 2024, 1:50 PM

Weekly update:

  • Scheduled additional times to discuss with Knowledge Gaps team members

Weekly update:

  • Had 4 meetings with members of Knowledge Gaps team to solicit feedback and rubberduck, which was extremely helpful!
  • Detailed notes to come, but summary for now:
    • Discussed multiple options for what the dataset could entail including
      • dataset which follows current content gap dataset schema
      • dataset which will use third-party data we don't currently have, and/or
      • dataset which includes various gaps related to topics for impact and the coverage of those per language edition.
    • For each of these three options above, drafted multiple variables and multiple use cases.

Weekly update:

Final update:

Finished proposals for new metrics

  1. Language representation dataset, with Ethnologue, UNESCO, and Glottolog data incorporated
  2. Coverage of vital articles and/or topics for impact
    1. Using standard schema (i.e., one row per wiki_db)
    2. Using new schema (i.e., one row per article, per wiki_db)
  3. Coverage of articles about specific languages