Page MenuHomePhabricator

Confirm whether or not the current definition of new_editor_retention is based on global registration and update data glossary
Open, MediumPublic

Description

Our data glossary is not clear whether the new_editor_retention calculated form the new editor table is defined by registration date GLOBALLY or within a specific project. This task is to clarify the current definition based on our code and make sure this is more explicit in the data glossary.

Related:
https://phabricator.wikimedia.org/T313549

GDI has related requests & work:
https://phabricator.wikimedia.org/T310224
https://phabricator.wikimedia.org/T304995

As we begin to slice data further, we want to make sure we are maintaining consistency across definitions or specifying differences in terms.

Event Timeline

The New Editor table (New Editor table update query) pulls data from the MediaWiki history table and filters out accounts created by the system by excluding

WHERE NOT event_user_is_created_by_system

this catches the majority of users who already have an account name. There remain a few edge cases, 85 rows returned from the below query, that require additional consideration to ensure that the table logs only those users that are truly new to the ecosystem.

SELECT user_name, COUNT(*)
FROM new_editors
GROUP BY user_name 
HAVING COUNT(*) > 1

The number is small such that it doesn't impact the definition, but it's good to keep in mind to understand the situation when the edge cases arise in the analysis.
It is not immediately clear how these edge cases are logged as new editors when they begin editing on a new wiki after already having been an editor with a logged-in account.
GDI is looking at new ways to track new editors by user_name account creation date which would bypass this issue.

Additional details:
Code used to calculate New Editor Retention by wiki:

select 
    date_add('month', 2,cast(concat(cohort,'-01') as date)) as month,
    wiki,
    count(*) as new_eds,
    TRY(cast(count(case when "2nd_month_edits" >= 1 then 1 end) as DOUBLE) / cast(count(case when "1st_month_edits" >= 1 then 1 end) as DOUBLE)) as new_editor_retention
from wmf_product.new_editors
where (cohort  like '%2020-%'  or cohort  like '%2021-%' or cohort  like '%2022-%')
group by cohort,wiki
UNION ALL
select 
    date_add('month', 2,cast(concat(cohort,'-01') as date)) as month,
    'all projects' as wiki,
    count(*) as new_eds,
    TRY(cast(count(case when "2nd_month_edits" >= 1 then 1 end) as DOUBLE) / cast(count(case when "1st_month_edits" >= 1 then 1 end) as DOUBLE)) as new_editor_retention
from wmf_product.new_editors 
where (cohort  like '%2020-%' or cohort  like '%2022-%' or cohort  like '%2021-%' )
group by cohort

New Editor - catches the user_name, the date the user_name was created and the first project it was created on as well as the user's id on that project and the number of edits on that project (based on the cohort analysis)...this dataset looks like it works on a local basis. So the New Editor Retention 'all' numbers are useful informed rough estimates on the new_user and their edits on that first project but the data to rely on and dig into should be the local data. It may be that the 'all' and the project_family data may miss further edge cases. For example, if we log a new user on enwiki who then also creates an account on wikivoyage within the same month, we log their activity on enwiki. if they stop editing on enwiki in the second month and do the rest of their editing on wikivoyage are they counted as not retained because the primary log enwiki has zero edits in the first month?

Iflorez renamed this task from Confirm whether or not the current definition of new editors are based on global registration and update data glossary to Confirm whether or not the current definition of new_editor_retention is based on global registration and update data glossary.Jul 22 2022, 9:09 PM
Iflorez updated the task description. (Show Details)
kzimmerman triaged this task as Medium priority.Jul 25 2022, 6:09 PM
kzimmerman moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.