Page MenuHomePhabricator

Calculate Emerging Markets new editor retention within the limits of geolocation data purging
Closed, InvalidPublic

Description

We can only rely on the editors_daily table to have data for the month of the latest mediawiki_history snapshot and the previous one, but with our current definition, new editor retention requires the last 3 months of data.

Approach: We are currently working on validating this Hypothesis for privacy-friendly retention measurement to explore solutions for aggregating and calculating new editor retention at geo level .

In progress

  • Discussing with Legal for data retention of editors data for 6 months to do analyses on editors who move countries.
  • Using currently available editors_daily data to be able to determine the proportion of editors who move locations.

Questions:

  • Can data be processed on a rolling basis/more frequently to allow calculation of new editor retention by market without retaining data past the 90 day retention period?
    • Answer: we (Kate, Maya, Irene) discussed what this would take on the product analytics side, and it would be a significant investment of work to redo and QA the queries and calculations, even if the Data Engineering team could update their code to process the data more than 1x/month and adjust the data scrubbing process. Rejecting this option for now as it is not a quick fix.
  • Could we retain a split by emerging vs. existing markets when we scrub country data from an editor?

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptNov 17 2018, 7:33 PM
nshahquinn-wmf renamed this task from Make calcuation of Global South new editor retention tolerant of the geolocation data purging to Calculate Global South new editor retention within the limits of geolocation data purging.Nov 30 2018, 7:24 PM
nshahquinn-wmf claimed this task.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf moved this task from Triage to Next Up on the Product-Analytics board.
nshahquinn-wmf added a subscriber: nettrom_WMF.
nshahquinn-wmf added a subscriber: kzimmerman.

@kzimmerman also pulling into Next Up as a blocker for November's movement metrics.

This has not blocked metrics in awhile; we'll have to address at some point but it's not urgent

nshahquinn-wmf added a subscriber: cchen.

@cchen ran into this again trying to do the September metrics

This is being addressed in T238793

This is being addressed in T238793

It's important to note that while monthly aggregation will work for most of the metrics on that page (content interactions, new content, monthly edits, and active editors), I'm pretty sure it will not work for new editor retention. In order to effectively calculate new editor retention given the limitations outlined here, it's not sufficient to save counts of new editors from emerging and established markets; we'd have to make lists of those editors and save them beyond the current data retention limits. Based on what's written in T238793, it sounds like we haven't considered that in detail, so I'm reopening this.

Please close this again if I'm missing something.

@cchen can you add an update on the status of work relating to this?

@nshahquinn-wmf @cchen: Can you please clarify

Month 1Month 2Month 3Global South Editor
NorthNorthSouth?
NorthSouthNorth?
SouthNorthNorth?
SouthNorthSouth?
SouthSouthNorth?
NorthSouthSouth?

Wondering how geo-categorization changing over time would affect inclusion or exclusion in the metric calculation.

kzimmerman renamed this task from Calculate Global South new editor retention within the limits of geolocation data purging to Calculate Emerging Markets new editor retention within the limits of geolocation data purging.Oct 19 2021, 5:27 PM
kzimmerman reassigned this task from kzimmerman to cchen.
kzimmerman raised the priority of this task from Medium to High.
kzimmerman updated the task description. (Show Details)

Nov1, 2021 Editor Retention discussions with Christina and Rui
What we discussed

  • Readout from DSE Hackathon geo metrics projects
  • Calculate unique editors across geo instead of project
  • Use user_name instead of user_id, since user_name is global and unique across projects
  • Explore solutions for aggregating editor retention at geo level A thought experiment for privacy-friendly retention measurement
  • Concerns Raised :
    • Editors who move countries will be excluded from the retention (i.e. numerator) because their repeat edit was not in the same country in which they had their first edit.
    • Small wikis : Editors safety issues
  • How is NER calculated currently : Editor registers and edits for the first time within 30 days of creating the account, edits again between 31-60 days of their first edit.
  • While we think about new metric for editor retention, or make data available for calculating NER, we need to find a way to calculate NER in the interim, as those discussions about a new metric could take a while
  • Try the Proposed Solution : Get a new dataset that will only have country level aggregates of editors and Try using this to calculate geo level retention 2 months later.

How can we arrive at a decision?

  • Need Analysis for understanding the impact of our assumptions and concerns
  • Look at trends - how predictive is 1month vs 6month retention
  • Do a comparison of the old retention and proposed solution for new metric

What is stopping us from taking this on soon?

  • prioritization
  • capacity

Other discussions that could influence this decision

  • do we need 2 month NER? Does it still match the product department’s needs
  • Carol, Christina and others are discussing about it, but it hasnt gained traction

Follow up discussion with Kate and Irene - vet Christina’s idea

Immediate next steps:

  • Create a temporary table : copy of editors_daily, scrubbing country and replace by Market
  • Partner with GDI, Request DE to prioritize: to add new fields ‘user_name’ in editors_daily

Adding @JCarvalho who is creating schema for campaigns-product T289894 and interested in using the global 'user_name' field

Created a copy of editors_daily in mayakpwiki.editors_daily with additonal column 'region' that has break up of economic region of the editor.

Currently working with the Privacy team to retain the data beyond 90 days and Explore solutions for aggregating editor retention at geo level - an experiment for privacy-friendly retention measurement.

Update: Scheduled meeting with Privacy and Security teams to discuss further on Jan 19, 2022.

Meeting has been moved to Feb 24 due to conflict with APP meetings.
Currently Blocked on Legal(Privacy) and Security.

@Mayakp.wiki thank you for the update; we're watching this for T289894

Mayakp.wiki lowered the priority of this task from High to Medium.Mar 23 2022, 12:52 AM
Mayakp.wiki moved this task from Blocked to Doing on the Product-Analytics (Kanban) board.
Mayakp.wiki removed Mayakp.wiki as the assignee of this task.EditedNov 18 2022, 9:45 PM

This will be taken up as a part of One Foundation: Metrics that Matter.
unassigning myself and moving to Backlog so we can reprioritize later.

Emerging vs. established markets division has been deprecated - https://phabricator.wikimedia.org/T316580