Page MenuHomePhabricator

Request for Q1 Core Metrics Breakdowns (Content and Relevance)
Closed, ResolvedPublic

Description

Name for main point of contact and contact preference
Caroline Myrick (@CMyrick-WMF)

What teams or departments is this for?
Hypothesis 2.3 team

What are your goals? How will you use this data or analysis?
To use Content and Relevance core metrics breakdowns in internal and external reporting (SDS 2.3)

Our goals include

  • providing Wikipedia-edition level Content metric breakdowns to parallel the type of metrics the public may be familiar with, e.g. via https://stats.wikimedia.org/
  • providing Wikipedia-edition level Content metric breakdowns so that users, user groups, and other affiliates who are engaged with specific language versions of Wikipedia can engage with the Content metric more granularly
  • providing project-level Relevance breakdowns so that users and user groups who are engaged with specific projects can engage with the Relevance metric more granularly

What are the details of your request? Include relevant timelines or deadlines
Is this request urgent or time sensitive?

For Content metric,

  • % new quality biography articles about women & gender-diverse people (T346262) for Q1 <-- Requesting a breakdown of this measurement at the level of individual Wikipedia (language versions)
  • % new quality articles about regions that are underrepresented (T346262) for Q1 <-- Requesting a breakdown of this measurement at the level of individual Wikipedia (language versions)

For Relevance metric,

  • UPDATED: Global monthly unique devices on Wikipedia for Q1 <-- Requesting breakdown of this measurement at the level of individual Wikipedia (language versions)
  • UPDATED: Regional monthly unique devices on Wikipedia for Q1 <-- Requesting breakdown of this measurement at the level of individual Wikipedia (language versions)

Date for requested data: If possible, we would like these data the first week of October (same deadline as when you'll be providing the topline Content and Relevance metrics to stakeholders)

Details

Due Date
Oct 12 2023, 12:00 PM

Event Timeline

CMyrick-WMF updated the task description. (Show Details)
CMyrick-WMF updated the task description. (Show Details)

@CMyrick-WMF clarified her request related to Relevance metrics. The original request has been updated.

This comment was removed by CMyrick-WMF.

Decision record for the "gender diverse" classification :

Summary
For Core Annual Plan Metrics Content metric, a classification schema was needed for the gender categories required to measure % new quality articles about women and gender-diverse people. A classification schema has been finalized and is being documented here.

Status
Classification finalized. (See "Decision" below).

Decision-making process

  • Re. the name of the category as "Gender Diverse":
    • Discussions among metrics working group: decision made to name category "Gender Diverse"
  • Re. the schema for classifying "Gender Diverse"
    • Discussions between Miriam and Queering Wikipedia community re. classification schema.
    • Discussions among Research & Decision Science staff (Kate, Miriam, Omari, Maya, Hamid, Caroline) and Community Growth staff (Becky), regarding the classification schema.
    • Determined alignment between Queering Wikipedia feedback, Community Growth methodology, Research methodology (especially Community Insights methodology).

Context and problem statement

For the 2023-2024 Core Annual Plan Metrics, the Content metric includes "% new quality articles about women and gender-diverse people." Because gender identities can be classified in different ways, and because various gender categorizations have been used in Foundation work, a schema was needed to determine how to classify "men", "women, and "gender-diverse" people for Core Annual Plan Metrics calculations.

Data for calculating the "% new quality articles about women and gender-diverse people" come from the Knowlege Gaps (Gender Gap) dataset. In this dataset, gender class is determined by the Wikidata property P21 (sex or gender). The class can take ~40 different values, as listed in this file. As such, a schema is needed for bucketing those classes into "men", "women", and "gender-diverse" for the Core Annual Plan Metrics content metric.

Risks/Caveats
We may need to revisit this schema as new gender classifications are entered into Wikidata.

Decision
Classification schema for "gender diverse":

--- Note: Under this schema, "men" refers to non-gender-diverse-identifying males and/or men, and "women" refers to non-gender-diverse-identifying females and/or women. As such, "men" excludes gender-diverse-identifying men, e.g. transgender men, and "women" excludes gender-diverse-identifying women, e.g. transgender women.

CASE
    WHEN (category IN ('male', 'cisgender male')) THEN 'Men'
    WHEN (category IN ('female', 'cisgender female')) THEN 'Women'
    ELSE 'Gender_Diverse'
END as gender

Classification schema for "gender diverse" (R version):

# Note: Under this schema, "men" refers to non-gender-diverse-identifying males and/or men, and "women" refers to non-gender-diverse-identifying females and/or women. As such, "men" excludes gender-diverse-identifying men, e.g. transgender men, and "women" excludes gender-diverse-identifying women, e.g. transgender women.

gendata$gender3category[gendata$category!="male" & 
                        gendata$category!="cisgender male" & 
                        gendata$category!="female" & 
                        gendata$category!="cisgender female"] <- "Gender_Diverse"

gendata$gender3category[gendata$category=="male" | 
                        gendata$category=="cisgender male"] <- "Men"

gendata$gender3category[gendata$category=="female" | 
                        gendata$category=="cisgender female"] <- "Women"

gendata$gender3category <- factor(gendata$gender3category, 
                                   levels = c("Men", "Women", "Gender_Diverse") )
Hghani subscribed.

Data requests were changed/are asterisked due the following caveats:

Relevance metric (change):

Global monthly unique devices by wikipedia domain was changed to global unique devices for the wikipedia project family by wmf region. This request was modified because a global per-wikipedia-domain breakdown contains duplicate data in cases where users visit the mobile domain and desktop domain. There is also an issue of duplicates on an aggregate level if a user visits two different wikipedia languages in which they are counted twice. The duplication occurs as a result of the domain tracking methodology used by wmf: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution
Datasets was provided.

Content metric*:

As mentioned in this ticket newer snapshots of the content gap metrics can possibly change values in previous time buckets. Movement insights continues to use the original snapshot's data (i.e., using a fixed past) as source of truth and will append new rows for additional monthly metrics using the latest snapshots to the metric spreadsheet. The result is that the per domain wikipedia data is using values from the newer snapshot as the orignal snapshot did not contain per domain data. If archived data for the older snapshot is used then the data wil be regenerated using that; if not, then the data from the newer snapshot will be provided given that the changes are small on an aggregate level. Datasets were provided

Q1 22-23 baseline data was provided. (Thank you!)

Assigning back to @Hghani for Q1 23-24 data. (Note: July and Aug only, since Sep not yet available)

@CMyrick-WMF do you mind opening a new task for the Q1 data? since the original request was completed within the due date.

@CMyrick-WMF do you mind opening a new task for the Q1 data? since the original request was completed within the due date.

I'll be sure to create a new ticket for Q2 data, but I'll keep this one open until the Q1 request here is complete.

Updated ticket description:

  • Added link to corresponding ticket (T346262)
  • Added "Q1" in data being requested, for clarify, and to differentiate from future tickets where I'll be requesting breakdowns for subsequent quarters.
CMyrick-WMF changed Due Date from Oct 2 2023, 4:00 AM to Oct 11 2023, 4:00 AM.Oct 4 2023, 1:19 PM

Knowledge gaps snapshots to use for Content metic: https://analytics.wikimedia.org/published/datasets/knowledge_gaps/content_gaps/

  • Use the "2023-07" snapshot for July data
  • User the "2023-08" snapshot for August data
CMyrick-WMF renamed this task from Request for Core Metrics Breakdowns (Content and Relevance) to Request for Q1 Core Metrics Breakdowns (Content and Relevance).Oct 4 2023, 1:56 PM
CMyrick-WMF changed Due Date from Oct 11 2023, 4:00 AM to Oct 12 2023, 12:00 PM.

Movement insights decided on the following: 1) back fill July and August 2023 underrepresented regions data with the October 6 snapshot as that has not been reported. Going forward each new month will appended to this snapshot to maintain historical consistency. 2) Since the gender classification needs to be recalculated using the schema provided by @CMyrick-WMF the October 6 snapshot was used with the up-to-date schema to generate all historical and current data. Going forward each new month will appended to these snapshots to maintain historical consistency.

Data provided. Thank you!

domain_wiki_devices_2022_2023.tsv

  • Contains unique devices (Relevance Metric) calculations for Q1 2022-23 and Q1 2023-24, at the level of domain (e.g., en.wikimedia.org, en.m.wikimedia.org) and wmf region
  • Q1 2022-23 data includes July, August, and September
  • Q1 2023-24 data includes July, August, and September
  • Cannot add domain numbers together because duplicates exist between mobile and desktop domains

project_family_region_unique_2022_2023.tsv

  • Contains unique devices (Relevance Metric) calculations for Q1 2022-23 and Q1 2023-24, at the level of wmf region; aggregated all wikipedias per region
  • Q1 2022-24 data includes July, August, and September
  • Q1 2023-24 data includes July, August, and September

geo_quality_articles_wiki_2022.csv

  • Contains article quality %s (Content Metric) calculations for new quality articles for Q1 2022-23, per geographic category
  • Geographic categories represent the region that the content of the article is associated with, per wikidata (e.g., Napoleon article = ”Northern & Western Europe” geography)
  • Q1 2022-23 data shows average of July, August, and September
    • Average = SUM(new quality articles about the region in July, August, September) / SUM(new quality articles about all regions, including unclassed, in July, August, September) * 100

geo_quality_articles_wiki_2023.csv

  • Contains article quality %s (Content Metric) calculations for new quality articles for Q1 2023-24, per geographic category
  • Geographic categories represent the region that the content of the article is associated with, per wikidata (e.g., Napoleon article = ”Northern & Western Europe” geography)
  • Q1 2023-24 data shows average of July and August only
    • Average = SUM(new quality articles about the region in July and August) / SUM(new quality articles about all regions, including unclassed, in July and August) * 100

quality_gender_updated_2022.csv

  • Contains article quality %s (Content Metric) calculations for new quality biographies for Q1 2022-23, per gender category
  • Gender categories represent the gender of the person the biography is about (e.g., Napoleon article = “men” gender category)
  • Q1 2022-23 data shows average of July, August, and September
    • Average = SUM(new quality biographies about women and gender-diverse people in July, August, September) / SUM(new quality biographies about all genders, in July, August, September) * 100

quality_gender_updated_2023.csv

  • Contains article quality %s (Content Metric) calculations for new quality biographies for Q1 2023-24, per gender category
  • Gender categories represent the gender of the person the biography is about (e.g., Napoleon article = “men” gender category)
  • Q1 2023-24 data shows average of July and August only
    • Average = SUM(new quality biographies about women and gender-diverse people in July and August) / SUM(new quality biographies about all genders, in July and August) * 100