Page MenuHomePhabricator

WE1.5.3 Productize Data for Monthly Active Moderator Actions
Open, Needs TriagePublic

Description

The Movement Insights team is aiming to have a monthly active moderator metric operationalized by the end of the fiscal year, so that we can use this metric in reporting towards community health broadly, but also within the Contributor Strategy more specifically. That metric will be defined by the actions marked as must-haves in this sheet.

Ask for Data Engineering: By the end of Q3, we would like all must-have actions productionized in a way that moves them from "Complicated" to "Simple" for Ease of measuring, meaning we need the data set up in tables that we can refer to regularly for reporting. The data is all available already, but manual data pulls are currently required.

Event Timeline

adding here that the 1) content diffs and 2)edit types pipelines are key to making this workflow easier. awaiting documentation on these pipelines from @fkaelin

Looking at the attached sheet, I see the following must-have actions listed as "complicated":

ActionHow common?Ease of measuringWhere's the data?Must/Nice to have
Adding a messagebox to article pages3Complicatededit diffsMust have
Adding a maintenance category to a page4Complicatededit diffsMust have
Applying a deletion template to a page4ComplicatedMust have
Requesting deletion of a page (e.g. speedy deletion)4ComplicatedMust have
Adding in-line notes (citation needed, etc.) to article pages3Complicatededit diffsMust have
Editing a deletion discussion3ComplicatedMust have

(Sorry for copying the rows here, this is just to narrow down the Data Engineering request: obviously the original sheet should remain the source of truth!)

I'm unclear on which table(s) "edit diffs" refers to, but I'll make an educated guess that all of the above actions can be measured by either:

If that is the case, it seems the work for Data Engineering is to ensure these tables are reliable and available to Movement Insights in the Data Platform, but not introduce other pipelines or tables than those.

We're currently trying to work with Research in Q3 to define a minimal and pragmatic model of:

  • What are the conditions for a pipeline/table to be considered "productionized", and
  • What kind of support a productionized pipeline/table receives from DPE, and how its maintenance and operations are shared between teams (aka a shared ownership model).

We're targeting content_diff as the primary use case, but naturally this model should apply to any pipelines. So, if I'm not missing anything, the Q3 work can be to define the above standards and apply them to content_diff and edit_types.

Does that sound about right? Currently, we're thinking of this as Essential Work, but if we can tie this to WE1.5, it should also de-risk it; we can work out a couple of hypotheses to make all of this happen?

Great to hear. @GGoncalves-WMF your educated guess is indeed such. Here some additional context

Content diff

  • pipeline code, airflow dag (code), datahub
  • caveats:
    • the research content diff dataset only includes content from the main namespace. A need for other namespaces (e.g. talk pages for newcomer assistance) has come up repeatedly; the production dataset should ideally include all namespaces
    • the content diff dataset (previously called wikidiff) predates the content history dumps2 dataset. the pipeline has been migrated to use the content history as input (making this a daily dataset too), but there are naming differences that we might want to consolidate (wiki_id vs wiki_db etc)
    • the content diff dataset is triggered by the content history dag sensor , and "inherits" the reconciliation from that state - any subsequent reconciliation applied to the content history dataset is not applied. The content diff is thus missing a proper reconciliation mechanism.
    • A content diff dataset is derived from other "content xyz" datasets (e.g. wikitext, html), so there should be an equivalent for the eventual html dataset (T360794).

Edit types

  • the edit types library produces a structured representation of the difference between two revisions. It is not new - it can be used online and there are cloudvps services for it (webpage, json api). However, generating a datalake edit types dataset was non-trivial; the pipeline depends on the content diff, and generating the edit types itself is a computational challenge. See T351225 for more background.
  • Research has recently made a push to enable this pipeline. The edit types pipeline is in this MR, and there is an edit types dag branch that runs on a development instance for testing/scaling. We will hold off deploying this on the research instance to discuss with DPE on how to proceed.
  • Similar to the content diff itself, the caveat regarding reconciliation applies too: the edit types dag is triggered by the partially reconciled content diff dag.

Hi @GGoncalves-WMF! Thanks for taking this on! To your comment about currently seeing this as essential work - if possible, I would suggest making this a hypothesis under WE1.5, because this is a strategic request to support the WE1 objective and as such doesn't really fall within the definition of essential work. I'll add you to our slack channel for now, and @ldelench_wmf and myself can help with thinking through what a hypothesis would look like there. Thanks again!

@fkaelin and I just chatted a little more about this, quoting here:

Here are the datasets,

  • research.edit_types_html - html data only available using research's non-prod research.mediawiki_content_html since 2025-03-01. The dag backfilled and running daily.
  • research.edit_types_wikitext - using research.mediawiki_content_diff as input. Manually "batch" backfilling since 2024-01-01 (arbitrarily chosen, can be extended further). Once backfill is complete (to Dec6th, the dataset will be updated daily too).

The data can be inspected in these tables, of course there are caveats that percolate (only main namespace for wikitext content diff, html needs revision and parent which is unsolved challenge, backfilling logic, etc).

My understanding is, then:

  • The requested tables are available for querying in the Data Platform (example) and updated daily, but with caveats on completeness, and generated from non-production Research infrastructure.
    • Notably, I imagine having only the "Main" namespace might be a problem as some moderation templates might be added into other kinds of pages?
  • Movement Insights should be unblocked for initial exploration of this data, which can happen in parallel with Data Engineering reviewing the pipelines. Perhaps @OSefu-WMF can confirm? In particular, it would be good to get their input on:
    • Which of the caveats or data issues are the highest priority in this context.
    • Whether both tables (edit_types_wikitext and edit_types_html) are needed, or whether wikitext is enough. The reasoning here is that html has one extra dependency table, and possibly a more complex pipeline overall.

Overall, my intention here is to de-risk this by seeing if MI can be unblocked early, and help inform the scope of what we actually need to do to productionize this(ese) table(s).

This is a very helpful overview. @GGoncalves-WMF. I responded to you earlier in slack but just repeating here:

Overall we are unblocked for exploration work in Q3. Those are two good questions that we'll have a better answer to after some initial scoping of the work in early Jan.

Ahoelzl renamed this task from Productize Data for Monthly Active Moderator Actions to WE1.5.3 Productize Data for Monthly Active Moderator Actions.Jan 28 2026, 5:21 PM
Ahoelzl added a subscriber: AKhatun_WMF.

@OSefu-WMF do you have an update for us? We'd like to get implementation work started soon.

After requirements gathering with Research and other teams (Doc), we are proceeding with streaming HTML and edit types rather than batch-only. We will deliver:

  1. a streaming HTML enrichment (latest revision HTML + diff to parent) from mediawiki.page_change (T360794) and
  2. a streaming edit types dataset (T351225),

both persisted to the data lake. WE1.5.3’s role is to ensure these production datasets meet Movement Insights’ needs for the must-have moderator actions and to unblock WE1.5.4 (@CMyrick-WMF) to compute the monthly active moderator metric and build reporting/dashboards.

Target remains 24th April; we will flag if we are at risk and need more time.

Resources: