Page MenuHomePhabricator

Develop Metrics for the Language Gap: Continue compiling data and begin calculations
Open, Needs TriagePublic

Assigned To
Authored By
CMyrick-WMF
Oct 5 2023, 12:29 PM
Referenced Files
Restricted File
Mar 22 2024, 7:49 PM
Restricted File
Mar 22 2024, 7:47 PM
Restricted File
Mar 22 2024, 7:45 PM
Restricted File
Mar 8 2024, 10:26 PM
Restricted File
Mar 8 2024, 10:26 PM
F42475911: Screenshot 2024-03-08 at 5.01.11 PM.png
Mar 8 2024, 10:02 PM
F42475908: Screenshot 2024-03-08 at 5.01.02 PM.png
Mar 8 2024, 10:02 PM
F42472094: Screenshot 2024-03-08 at 1.43.55 PM.png
Mar 8 2024, 10:02 PM

Description

We have successfully begun a cross-team project examining Incubator and language representation across Wikimedia projects.

The project include the following goals:

  • Develop metrics for the state languages at Wikimedia
  • Develop metrics for better understanding Incubator
  • Develop knowledge gaps metrics for measuring language gaps

This task addresses the goal of developing knowledge gaps metrics for measuring language gaps.

For Q2/Q3, I will

  • Finish integrating primary data via wrangling scripts in GitLab repo
  • Acquire the needed secondary data
  • Integrate secondary data via wrangling scripts in GitLab repo
  • Determine home for the integrated dataset(s) (Hive? keep in Gitlab?)
  • Begin calculations for metrics
  • QA

This task has a dependency on task T348249 that needs to be resolved in order to acquire the needed secondary data

January 2024 update: Due to the unresolved dependency, this task is going to take multiple quarters until blocker is resolved.

Event Timeline

Weekly Update:

  • Created Q2 phab task
  • No update this week

Weekly Update:

  • Brought blocker (see T348249) to the attention of the Language Inclusion working group
  • No other major updates this week.

Weekly Update:

  • No major updates this week.

Weekly Update:

  • No major updates this week.

Weekly update:

  • Incorporated Wikiversity Beta data into wrangling files and output files
  • Added Multilingual Wikisource data file to source data folder
  • Incorporated Multilingual Wikisource data into wrangling files and output files
  • Reran stats in state_of_languages.ipynb

Next week:

  • QA/QC

Weekly update:

QA/QC ongoing, which has resulted in the following updates/corrections/revisions:

  • With Wikiversity Beta and Multilingual Wikisource incorporated into language.ipynb, the output "current_incubator_projects_list.tsv" was changed to "projects.tsv"
  • Incorporated graduation dates, as well as hosting status ("open", "closed") into the projects.tsv file
  • Added distinction between test projects that are first-time test projects in the Incubator, Wikiveristy Beta, or Multilingual Wikisource (i.e. "test") vs. test projects that were previously hosted but closed (i.e. "test (closed)").
  • Dealt with multiple issues of duplicate languages and or language name changes/inconsitencies
  • The state_of_languages.ipynb notebook now reflects these updates; as do all outputs, dependent notebooks, etc.

Highlights:

  • state_of_languages.csv: allows the user to view which projects (hosted, test, closed, or none) it has for each language
  • state_of_languages_sortable.ipynb: allows user to sort languages by number of projects (all), number of test projects, number of hosted projects
  • state_of_languages.ipynb: (scroll down to "How many languages....") provides user with stats about "How many languages have X projects?", "How many languages have a Wikipedia?", etc.

Note: QA/QC still underway for everything linked to above! Please do not use these data yet.

Weekly Update:

No major updates this week.

Weekly Update:

No major updates this week.

Weekly updates:

Continued discussion of blocker (T348249) with Language Inclusion working group, and plans for temporary place-holding data

Shared my deliverables (so far) with Language Inclusion working group, mentioned above in T348241#9342085:

  • state_of_languages.csv: allows the user to view which projects (hosted, test, closed, or none) it has for each language
  • state_of_languages_sortable.ipynb: allows user to sort languages by number of projects (all), number of test projects, number of hosted projects
  • state_of_languages.ipynb: (scroll down to "How many languages....") provides user with stats about "How many languages have X projects?", "How many languages have a Wikipedia?", etc.

Note: QA/QC still underway for everything linked to above! Please do not use these data yet.

Weekly updates:

  • Continued QA
  • Began analysis of top language stats

Weekly update:

Met with Language Inclusion working group; discussed current status of work in relation to the current dependencies.

No additional updates.

Weekly update:

Continued baseline stats for top 20 languages: state_of_languages.ipynb

  • See "Top Languages breakdowns" section
  • Contains overview of which of the top 20 languages have Wp, Wt, Wb, Ws, Wn, Wq, Wv, and Wy

Next week I will add

  • For the top languages, stats about the size of the projects (e.g. number of Wp articles for top 20 languages)
  • For the top languages, plot number of speakers vs number of articles

Weekly update:

New state_of_languages-top20.ipynb file, dedicated to analysis and plotting stats related to top 20 most spoken languages

  • Added stats related to speaker numbers
  • Added stats related to number of Wp articles
  • Plotted number of speakers vs number of Wp articles

Weekly update:

{F41754807}

Weekly update:

  • Worked on updating Incubator analysis notebooks to better work with Incubator data I've compiled and wrangled within the repo
  • Next week will finish the updates and rerun the analyses

Weekly update:

  1. Continued analysis of Incubator data, looking at length of time in Incubator
  1. Continued analysis of top 20 languages, incorporating more product data
    • Wikipedia, Wiktionary, and Wikisource "size" (i.e., number of mainspace pages):
      • Page counts query: wiki_page_counts_query.ipynb (2024-03-08 update: removed this separate query file, and incorporated query within the state_of_languages-top20.ipynb notebook itself)
      • Visualizations: state_of_languages-top20.ipynb (see "Top 20 Languages: Wikipedia Size", "Top 20 Languages: Wiktionary Size", and "Top 20 Languages: Wikisource Size")
    • Unique devices:
      • Unique devices query: wiki_unique_devices_query.ipynb (2024-03-08 update: removed this separate query file, and incorporated query within the state_of_languages-top20.ipynb notebook itself)
      • Visualizations: state_of_languages-top20.ipynb (see "Top 20 Languages: Wikipedia desktop readers" and "Top 20 Languages: Wikipedia mobile readers")

Weekly update:

  1. Now able to query data lake within notebooks running R kernels! 😄
  1. Created new exploratory visualizations for looking at time-spent-in-Incubator:
    • New visualizations (thanks to brainstorming with @KCVelaga_WMF!) in both_current_and_grad_incubator_history_visualizations.ipynb notebook showing distributions of years-taken-to-graduate, based on year project started.
      • See ## PLOT 4 and ## PLOT 5 in notebook
      • Darker greens = shorter amount of time spent in Incubator
      • Viz excludes projects started after 2018 as they would not have all had the chance to graduate in 5 years
      • Please do not use this visualizations outside the current project yet, as they are exploratory and still need another round of QA

Screenshot 2024-03-08 at 1.43.46 PM.png (558×1 px, 74 KB)
Screenshot 2024-03-08 at 1.43.55 PM.png (576×1 px, 75 KB)

  1. Created new exploratory visualizations for looking at graduation rates for projects with an immediately preceding graduated project of the same language vs. for projects without an immediately preceding graduated project in the same language:

{F42476395} {F42476401}

Weekly update:

  • Began writing and testing queries to incorporate product data (edits, editors, unique devices, page views, etc) into state-of-language analyses and Incubator analyses
  • Shared status updates with multiple working groups

Weekly update:

  • Improved spark queries for product data needed for language metrics and incubator analyses
  • Continued Incubator data exploration; the current_incubator_history_visualizations.ipynb file now has four sections:
    • Part 1: Visualizing project size (i.e. how many pages do the current Incubator projects have?)
    • Part 2: Visualizing first and last edits (i.e., how long have current Incubator projects been in the Incubator?)
    • Part 3: Visualizing project size x start date (i.e., looking at projects' number of pages x how long they've been in the Incubator)
    • Part 4: Visualizing monthly edits (i.e., for each project currently in the Incubator, how many edits were made each month?)
  • Lots of new visualizations including the following (Please do not use these visualizations outside the current project yet, as they are exploratory and still need another round of QA)

(Days in Incubator: visualizations show a wide range of days spent in Incubator, similar for each project)

{F43074160}
{F43074383}

(Monthly edits in 2023: visualizations indicate that the majority of projects in Incubator were not actively edited last year)

{F43074548}

Weekly update:

  • Added Metrics section to meta-wiki project page, where quantitative metrics related to the state of languages across Wikimedia projects are drafted, specifically:
    • Key term definitions
    • Table: "Metrics: state of languages"
    • Table: "Metrics: state of top languages"
    • Table: "Metrics: state of threatened languages"
    • Table: "Metrics: language-specific"
  • Tables include the location of notebook, dataset, and/or visualization where each metric is currently calculated or displayed
CMyrick-WMF updated the task description. (Show Details)
CMyrick-WMF updated the task description. (Show Details)

Weekly update:

  • Continued QA (focused on wrangling code in the load_dataframe_for_analysis.ipynb notebook; moved code from current_incubator_history.ipynb and incubator_graduate_history.ipynb (and then deleted those notebooks after migrating the code)
  • Created a "/tsv" folder; outputs from load_dataframe_for_analysis.ipynb notebook will need to be saved as TSVs to be loaded into analysis/viz notebooks running R kernels
  • After finishing reorganization of the "/incubator" folder, updated the README and added a README in the "/tsv" folder

Weekly update:

  • Continued QA
  • Provided overview of new Metrics tables in the meta-wiki project page at the Connecting the Dots on Language Inclusion working group meeting

Weekly updates:

Weekly updates:

Weekly updates:

  • No major updates related to the integration of secondary data sources
  • Met with Movement Insights to brainstorm ideas about productionizing these data
  • For additional updates, see T361640

Weekly updates: