Page MenuHomePhabricator

Active editor queries incorrectly calculate global registration date
Closed, ResolvedPublic

Description

wmf_product.editor_month groups by user, wiki, and month. There, user_registration refers to the local registration on that specific wiki.

Then, the query that calculates active editors from editor_month attempts to convert this into a global registration date by grouping by user and month and taking the minimum registration date. However, it's gets the minimum local registration date among the wikis where the user was active that month. So if a user registered in Jun 2008 on enwiki, then, in Jan 2024, they "registered" on Wikidata and made some edits, but they didn't also edit on enwiki during that month, they show up in our stats as a new active editor.

The impact should be relatively small, since it only affects users who meet all the following conditions in one month:

  • had their account autocreated on a new wiki (meaning they hadn't previously visited the site while logged in)
  • made 5 or more content edits across all wikis
  • did not make any edits on any wikis where they have an earlier registration date

This affects several different data products, including the active_editors table, which isn't used to calculate the movement metrics but does back our editor metrics Superset dashboard and is used in our active editor query.

Next steps:

  • Correct the active editors query in the movement-metrics repo
  • Check the impact by running the new and old queries for at least the past few years (preliminary results show < 1% impact on topline YoY new active editor monthly metrics)
  • Rebuild the wmf_product.active_editor table (deferred to T333225)
  • Decide what to do about saved active editor values (we don't have a general strategy for how we deal with changes to source data)

Details

TitleReferenceAuthorSource BranchDest Branch
Update February 2024 metricsrepos/movement-insights/movement-metrics!6nshahquinn-wmffeb2024metricsmain
Improve readability of active_editors.sql syntaxrepos/movement-insights/movement-metrics!5nshahquinn-wmfhghani-active_editors_sql_editmain
Customize query in GitLab

Event Timeline

thanks @Hghani ! once this is corrected or confirmed as a caveat/known issue we should document it on Datahub.

nshahquinn-wmf renamed this task from wmf_product.editor_month contains incorrect user_registration field for some editors. to Active editor queries incorrectly calculate global registration date.Feb 28 2024, 10:29 PM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf updated the task description. (Show Details)

I changed things to reflect that the issue really lies with the queries we run on editor_month, not with the data in editor_month itself.

JAnstee_WMF moved this task from Incoming to Doing on the Movement-Insights board.

I've pushed an updated version of the active editors query in movement-metrics repo.

Analyzed effect of the bug on metrics from 2022 and 2023.

In summary: New active editor values on a monthly basis dropped by 1-2% (avearge -1.68%) across all months in 2022 and 2023. Top line YoY metrics that we report every month were minimally impacted with an average drop of 0.13% across the months that were reported in 2023 which shows that the effect of the bug in the query is cancelled out when taking the difference YoY.

Calculations are in this sheet.

Movement-insights decided to manually backfill the saved active editor values in editing_metrics.tsv and in our editors google sheet. Active editors, returning active editors, and new active editors were replaced in the tsv and pushed to the movement-metrics repo with a commit comment indicating that the entire history of these 3 columns were regenerated with the corrected active_editors.sql query. The google sheet was also updated and the original was preserved in the named version history as pre_t358684.

Thank you, @Hghani!

Top line YoY metrics that we report every month were minimally impacted with an average drop of 0.7% across the months that were reported in 2023

Could you actually clarify this? Was the 0.7% decrease:

  1. in the absolute difference (e.g. the YoY change was previously -1000 editors, but now it's -1007)
  2. relative in literal percentages (e.g. previously -10%, now it's -10.07%), or
  3. relative in percentage points (e.g. previously -10%, now -10.7%)?

@nshahquinn-wmf

It would be (3) relative in percentage points. For example, the original reported YoY change was -13.1% for in January 2024 for the new active editors metric, but the corrected value is -13.5% which I calculated as a 0.4% difference. The average was then computed of all the percentage point differences for 2023 vs 2022 to assess the impact. As a note, upon triple checking I noticed there was a small formula error in the sheet so the previous average I computed has been adjusted to the following: Topline YoY metrics for new active editors were %-0.13 less on average (not -0.7%). The original comment has been updated.

I hope that clarifies.

@Hghani okay, that makes sense! In the future, I think it's helpful to use "percentage point" (pp) rather "percent" in this situation (i.e. the average difference in the year-over-year values is -0.13pp).

Anyway, I think this is done now. The only item remaining from the description is to regenerate wmf_product.active_editor, but I think we can wait and do that as part of T333225.