Page MenuHomePhabricator

DE1 - Retained editors
Open, HighPublic

Description

Definition:

Out of the logged-in users on Wikipedia who made at least one edit in the previous month, the number who also edited during the current month. This includes all edits (including edits to content, talk, and other namespaces) whether or not the edits have been reverted.
(data glossary)

Measurement plan:
Retained Editors - Decision Brief: Developing Metrics along our Online Contributor Pipeline

Data Source:

Does the data exist?
Yes: editor_month ( which is sourced from wmf.mediawiki_history )

Baseline:

89.3k (source)
(Average of April 2025 to March 2026)

Target:
107k (20% increase) by July 2027

Error (if applicable):
N/A

Automated Dashboard:
Contributors Metrics dashboard
Objective Level metrics dashboard

Test Kitchen Status:
N/A

Notes

A weekly update to this number was discussed and decided against. This is fundamentally a monthly metric.

Event Timeline

Mayakp.wiki moved this task from Start to Approved on the Metrics-Sprint-2026-2027 board.
Mayakp.wiki updated the task description. (Show Details)
Mayakp.wiki added a subscriber: MaCollins-WMF.

Hey all -- @Mayakp.wiki and I discussed this metric (and the related Active Editors metric). My summary below:

  • Overview: to me, this metric seems like it's in a pretty good shape. I have some suggestions for improvement, which I'll lay out below, but I don't see them as blockers to "success".
  • Should there be an error / confidence estimate for this metric? Maya and I agree that there shouldn't be. This is a count (not an estimate) and we don't know of any significant source of data loss so the number itself has no natural confidence interval. With any metric, there's a question of how to interpret changes in the trend (what's "significant"). I think that should be handled via proper contextualization / slices -- i.e. showing year-over-year numbers to handle seasonality, allowing for breakdowns by wiki -- which seem ready from what Maya was showing me.
  • Any major exogenous/confounding events that could impact this metric?: these are counts of editors (not edits) and deduplicated across wikis, so for there to be a confounding factor here, you'd have to have A LOT of individual accounts created/reactivated and used to edit in a recurring way. That seems highly unlikely to happen in a way that's not intended to be captured by the metric. The only thing I can think of is a bot farm. The metric does exclude "bots", but that relies on the user account having "bot" in its username or being added to the bot usergroup, both of which only happen in cases where the bot account is being appropriately declared. For example, as far as I can tell, User:TomWikiAssist (the AI Agent that edited Wikipedia) is not in the bot group because they never went through the appropriate process for requesting permission and while they semi-declared their AI-agentness via "Assist", they didn't follow the "bot" norm. So how to exclude bots that don't follow policy (as a potential confounding factor that would move the metric without being the impact we're actually intending)? I think that gets into how we handle reverts, which I touch on below. All of that said, this is a theoretical thing at this point. As far as I know, we haven't seen mass bot account creation so I think it's worth addressing but again I don't see it as a blocker to a "successful" metric.

Changes to make the metric more robust / consistent (this applies to both Retained Editors and Active Editors):

How we define mobile:

  • One slice of both of these metrics is "mobile" vs. "other". Retained editors is currently defined as logged-in users, 1+ edits, any namespace. As such, "mobile" is or should be defined as logged-in user, 1+ edits that happen on mobile, any namespace. Active editors is currently defined as logged-in users, 5+ edits, content namespaces only. As such, "mobile" is defined as the same but 5+ of those edits happening via mobile. The data for these metrics is derived from the editor_month (internal link) table, which is an aggregation over mediawiki_history that makes it very efficient to compute these metrics. The problem though is that the table includes just three edit count fields, neither of which has the data for Active Editors on mobile -- i.e. mobile edits just to content namespaces:
    • edit_count: all edits by that user to any namespace. That is what is needed for Retained Editors.
    • content_edit_count: all edits by that user to content namespaces. That is what is needed for Active Editors.
    • mobile_edit_count: edit_count but only the edits made on mobile. That is what is needed for Retained Editors (mobile).
  • This is a simple fix though for Active Editors on mobile, which is to just add a mobile_content_edit_count field as the intersection of mobile edits and content namespaces and use that instead of mobile_edit_count, which lacks the content namespace filter.

How we handle reverts:

  • Both Retained Editors and Active Editors include editors regardless of whether the editor is blocked. And they also count edits regardless of whether they are reverted, deleted/suppressed, or made to pages that are then deleted. Personally I found that a bit surprising that there is no notion of "constructive" or "good-faith" in this metric -- i.e. we are still counting editors as active if they create an account, go on a vandalism spree, and are blocked. I looked back through the history of these metrics, which brought me to this rationale from 2013. Two things stand out to me from there:
    • They wanted the metric to be stable -- i.e. the data for a given month wouldn't change over time. That's a valid criterion for a metric and is why they advocated for included deleted edits in the definition. Otherwise, a page getting deleted at some point in the future could reduce the active editor count in prior months. I would argue the same logic applies to whether a user is blocked -- i.e. this is a future event that would change past data and we should avoid including it. I also think this makes logical sense too -- i.e. a page getting deleted or a user getting blocked does not actually mean their edits weren't valid.
    • They also included reverted edits but don't provide a rationale. I'm almost certain this is because at the time, there was no easy way to determine if an edit was reverted or not. They were initially using the SQL tables to generate this data and the mw-reverted edit tag wasn't introduced until 2020 and the mediawiki_history dumps were introduced until 2016. So I think it just was infeasible at the time to remove reverted edits (it would have required a very expensive pass over the history dumps to calculate reverts based on shasums).
  • I do think we should consider filtering out reverted edits now that it's very easy to do. Considerations:
    • Pro: this won't impact the long-term stability of the data (reverts are limited in how far back in time they can be applied data-wise so you won't have future events dramatically changing past data).
    • Pro: this to me is a much better metric definition. As I said before, if someone can't achieve the 1 or 5 edit threshold with non-reverted edits, I'm not sure that we've accomplished our job. I think this is clearer with Retained Editors, which is the 1-edit threshold. If someone makes a single edit in a month and it's reverted, I would argue that they have not been retained. Perhaps it was a good-faith edit but the lack of non-reverted edits suggests that they were not able to contribute to the projects in that month. With active editors it's a bit hazier (what if they made 5 edits and 1 was reverted?) but in general any threshold is arbitrary and I think for consistency sake, we should also make it 5 non-reverted edits. This will also help with e.g., giving someone credit for multiple edits because they got in an edit war with someone else.
    • Pro: this will help with the bot-farm situation. If someone created a lot of accounts to edit Wikipedia in a non-policy-conforming way, these edits would be reverted by the community and that confounder would drop out of our metric accordingly. The "was this user blocked?" filter would have also helped with this, but that introduces the long-term instability into the data so I don't think is worth that tradeoff.
    • Con: wikistats has an active editors metric that is currently the same definition as what we're using here. If we change our own internal metric, we should also change wikistats for consistency (otherwise folks will be citing different numbers and being confused as to why). That's a bit more complicated of a process because it's a different system (I assume technically it's still a simple change but I don't actually know) but also we should then communicate the change so end-users aren't caught by surprise and that opens up more discussion.

Re-summarizing as this is a large "summary":

  • I think the Retained Editors metric is good for now. I would just make sure that "mobile" for Retained Editors is defined as 1+ edits on mobile, which aligns with how "mobile" on Active Editors is 5+ edits on mobile. No error bars need to be calculated and it looked like the dashboards were in pretty good shape (though Maya can speak to that).
  • I think the complementary Active Editors metric (5+ edits; content namespace) needs to be updated to restrict the "mobile" definition to be content-namespace only as well. This should be an easy fix (draft code below) but would require someone to re-compute that table with the new definition. Tagging @nshahquinn-wmf as the person who seems to have worked on that code the most in case you have insight on level of effort. I can't speak to the level of prioritization for that though.
  • I think it'd be best if we also restricted both Retained Editors and Active Editors to non-reverted-edits only. That's a larger decision and likely larger undertaking as WikiStats should also be updated. I'll defer to you all about how you want to handle that. The technical aspect is simple though (it's also in the draft patch below).

Draft patch of how to update editor_month (I don't have push privileges to that repo so I just demoed it in a fork): https://gitlab.wikimedia.org/repos/movement-insights/sql/-/merge_requests/9/diffs. Once these changes are made to the official Movement Insights repo, someone would have to run the backfill and then update the Active Editors metric code to use mobile_content_edit_count instead of mobile_edit_count.

Oh and forgot to say, but the reason I'm not concerned about making the change around constructive edits quickly is that it doesn't look like it'll have a large impact on the metric. This chart comparing the two approaches (internal doc) shows that Retained Editors and Retained Constructive Editors have very similar trends, just that Retained Constructive Editors is about 5% less editors.

Aklapper renamed this task from DE1 - Retained editors to DE1 - Retained editors.Thu, May 21, 4:49 PM