Page MenuHomePhabricator

Develop Metrics for the Language Gap: Assess data needs and begin compiling data
Closed, ResolvedPublic

Description

We have successfully begun a cross-team project examining Incubator and language representation across Wikimedia projects.

The project include the following goals:

  • Develop metrics for the state languages at Wikimedia
  • Develop metrics for better understanding Incubator
  • Develop knowledge gaps metrics for measuring language gaps

This task addresses the goal of developing knowledge gaps metrics for measuring language gaps.

As a first step, I need to complete a data needs assessment and begin compiling all primary and secondary data into usable Hive table(s):

  • Determine and document data that we currently have (via MariaDB, Hive, Meta page(s), etc.)
  • Determine and document data that we do not currently have, but will need in order to develop metrics for the language gap
  • Develop plan for acquiring secondary needed data
  • Acquire the needed secondary data
  • Add all data-related (source files and wrangling scripts) to the Gitlab repo
  • Combine data into usable Hive table(s)

Event Timeline

CMyrick-WMF updated the task description. (Show Details)
CMyrick-WMF updated the task description. (Show Details)
CMyrick-WMF updated the task description. (Show Details)

Weekly Update:

Last week and this week have primarily been about planning and coordinating between teams.

  • Meeting between Language team, Research, and @KCVelaga_WMF, where we discussed updates
  • Scheduled a follow-up meeting for July 31

Weekly Update:

  • Meeting between Language team and Research where we discussed updates
    • Language team will look over my proposed RQs and variables and provide feedback
    • Language team will identify variables & associated analysis needed for their work related to Incubator
  • @YLiou_WMF, @Miriam, and I met with UNSECO to discuss WAL data and institutional access

To-Dos:

Weekly Update:

  • Biweekly meeting with @KCVelaga_WMF
  • Began outlining data needed to begin analyses to inventory data needed and data we have, as well as how those data will be used to perform the calculations needed for the proposed metrics
  • Began restructuring the GitLab repo, to be more user-friendly and better reflect the steps/components of the project; and updated the ReadMe

Weekly Update:

  • Met with @Hghani to discuss repo reorganization and splitting of scripts into wrangling scripts and analysis scripts
  • Wrote script for joining country names with WMF regions, plus latitude and longitude
  • Wrote script for manually adding missing language data (that resulted form lack of successful match with canonical-data.wikis data) to list of closed projects
  • Added script outputs to 03_wrangled_data folder

Weekly Update:

  • Continued work with @Hghani revising notebooks
    • Finalized data sources for matching prefix codes ("ig" like in "Wp/ig" or "igwiki") with language names ("Igbo")
    • Finalized scripting steps /logic for matching codes and names
    • Continued naming convention work
    • Determined the three core TSVs we want to be able to provide to the public through our data folders in the repo:
      1. incubator legitimate project prefix list (with languages)
      2. incubator graduates list (with dates), and
      3. closed projects list (with dates)
  • Synced with Fabian about migrating the repo to the Research team's GitLab

Weekly update:

Continued work with @Hghani revising notebooks and organizing/inventorying repo

  • Made and reviewed revisions to language.ipynb
  • Determined how to run prior notebooks as part of analysis notebooks
  • Determined what additional filtering we need in place to generate the three core files we need (reminder: those are legitimate incubator prefixes; incubator graduates; and closed projects)
  • Hamid wrote a scraper for scraping Site_creation_log data (used to generate incubator_graduation.tsv)
CMyrick-WMF updated the task description. (Show Details)
CMyrick-WMF removed a project: Epic.

Weekly update:

Continued work with @Hghani revising notebooks and organizing/inventorying repo.

  • Continued QA
  • Researched mismatches and duplicates occurring when joining Incubator data with language name data.
    • Determined reasons
    • Added code to the language.ipynb notebook to account for and/or correct, and document the reasons
    • Provided a Notes section at the end of the language.ipynb notebook with documentation
  • Continued streamlining of code with @Hghani

Per https://phabricator.wikimedia.org/T341818, @fkaelin migrated this project's repo to the Research group.

Weekly update:

Continued work with @Hghani revising notebooks and organizing/inventorying repo and QA'ing code.

  • Continued QA and data sleuthing
    • Solved anomalies related to "che"/"ce", "mwr"/"rwr", and "nrf/nrm", and documented in the Notes section of language.ipynb
    • Solved anomalies related to "sh"/"hbs", and documented in the Notes section of language.ipynb
    • Solved anomalies related to "bh"/"bih"/"bho", and documented in the Notes section of language.ipynb
  • Revised language.ipynb code to produce the following output files:
    • "current_incubator_project_list.tsv": contains current legitimate incubator projects that have at least 1 page in addition to the landing page
    • "project_languages.tsv": contains all wikimedia content projects specialized by linguistic edition open and closed, as well as current and past Incubator test projects; variables include language code and language name
    • + have begun revising analysis notebooks that utilize these output files listed above. (New output files make some of the analysis notebook code redundant now).

Weekly update:

Began working on State of Languages dataset prep and analysis prep:

  • state_of_languages_prep.ipynb: notebook for creating the dataset
    • Incomplete. Still need to factor in coding for Wikiversity Beta projects and Wikisource test projects.
    • Currently written in R. Rewrote in Python but need to QC before merge request.
  • state_of_languages.csv: dataset
    • Incomplete. Still need to factor in coding for Wikiversity Beta projects and Wikisource test projects.
  • state_of_languages_sortable.ipynb: sortable and searchable dataset
    • Incomplete. Still need to factor in coding for Wikiversity Beta projects and Wikisource test projects.
    • Option for JupyterLab table or GitLab table.
    • Currently written in R.

Weekly update:

This task was not fully completed due to a blocker on external data availability (T348249). The remaining subtasks have been moved to the Q2 task on language gap metrics: T348241