Page MenuHomePhabricator

Productionized Edit Types
Closed, ResolvedPublic

Description

Goal

Produce a dataset of edit types for all edits to Wikipedia articles (namespace 0; non-redirect) that is available on HDFS. I can see two approaches that we might want to consider:

  • Batch: monthly Airflow job based on mediawiki_history and mediawiki_wikitext_history that produces edit types for all of last month's edits. This is simpler from an organizational perspective (less teams probably involved) but likely harder from a technical perspective.
  • Stream: based on page-change (much like mediawiki.page_outlink_topic_prediction_change.v1 stream for the articletopic-outlink model) that produces the edit types for all edits as they happen and saves them to an event table on HDFS. This feels like the better long-term solution but certainly requires more coordination.

In theory, they're both useful for analytics purposes but the stream could also potentially be used in Products (as input to revert-risk or other models; eventually filters for RecentChanges etc.) and computing in bulk from mediawiki_history is an expensive/slow operation because it requires a lot of shuffling wikitext. The outlier diffs can be pretty expensive too, so computing each diff individually in a stream helps failures from cascading to affect other diff computations.

Tasks

  • Batch job:
    • Isolate relevant edits and their associated metadata (easy)
    • Bring together current and previous wikitext pairs for every revision (lots of shuffling)
    • Compute edit types for these wikitext pairs (lots of computation; occasional outlier with huge memory consumption / time)
  • Stream job:
    • Apply edit filters to input page-change stream -- i.e. Wikipedia + namespace 0 + not redirect
    • Fetch current and parent wikitext from API (or perhaps consume from page-change-based stream that already has the current and potentially even parent wikitext?)
    • Compute edit types and add to new stream

Context

There are a number of spaces where I envision this being useful:

  • Large-scale analyses of edit / content dynamics on wiki -- e.g., akin to T334760#8782740 (batch or stream work)
  • Smaller-scale aggregations of data about edits for user-facing tools such as campaign dashboards (e.g., how many references were added by this campaign) or user stat pages (e.g., you've added 10 references this month).
  • As a stream that could be consumed by other LiftWing models to determine if they should be triggered -- e.g., perhaps we eventually have a model for analyzing URLs for fact-checking but that only needs to be triggered if an edit actually inserts/changes a URL on the page; or the readability model only should be triggered when page text changes?

There are still some open questions that we'll have to address:

  • What "edit types" to store? The library can produce a variety of outputs from the very raw to the more refined:
    • Basic: what types of nodes (References, Text, etc.) changed
      • This can also include the specific details of the change – e.g., what part of the Reference changed, which words changed, etc.
    • Refined: high-level categories like edit size, edit difficulty, edit category
  • Depending on the type of input, this also affects whether we use the Simple (and less prone to fail) version of the library vs. the Complex/Structured (and more prone to fail due to memory errors) version of the library.

Status Updates

  • 2026-02 - We will be pursing this ticket along with T360794: Event stream with latest revision HTML & parent revision HTML diff in order to emit both revision html and 'simple' edit types data to 2 different streams to support more use cases.
  • 2026-03 - edit types dev enrichment job is deployed in dse-k8s, consuming from page_html_change with diff and emitting edit types events to kafka jumbo-eqiad.
  • 2026-03 edit types event schema - the data model is mostly set based on conventions defined in T415158. We need to do some data product and field name bikeshedding, but the shape of the data is not expected to change.

To Do

  • Implement and deploy simple edit types streaming enrichment job - PoC out as of 2026-03
  • finalize 'simple edit types' event schema
  • Finalize 'simple edit types' stream data product name (html-feature-counts-change)

Done is

  • simple edit types streaming enrichment job is released and producing .v1 events to kafka jumbo-eqiad
  • simple edit type events are being ingested into a _v1 Hive table.

Follow ups

After the productionized edit type stream is released, we will still have some follow ups to do to ensure maintainability of the pipeline. Mostly these will be under T418996: Audit and fix observability (logging and metrics) for pyflink jobs, but there may be other tickets to create. These do not block the resolution of this ticket.

[Done] As of 2026-03-16, there are several #TODO comments in the enrichment pipeline we need to follow up on too.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Development rendering_feature_counts_changerepos/data-engineering/schemas-event-primary!38ottoedit_type_namingmaster
Update edit-type schemarepos/data-engineering/schemas-event-primary!37akhatunakhatun/update-edit-typemaster
Add edit_type_simple_field descriptionsrepos/data-engineering/schemas-event-primary!35ottoedit_type_descriptionsmaster
development/html_change/3.1.0 - add revision.content_slots fieldrepos/data-engineering/schemas-event-primary!34ottopage_html_change_content_slotsmaster
Add html based edit type flink enrichmentrepos/data-engineering/mediawiki-event-enrichment!118akhatunakhatun/edit-typemain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
OpenIsaac
ResolvedAKhatun_WMF
OpenNone
OpenNone
ResolvedAKhatun_WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
OpenJMonton-WMF
OpenNone
OpenNone
ResolvedJMonton-WMF
ResolvedJMonton-WMF
OpenNone
ResolvedOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedOttomata
ResolvedOttomata
OpenJMonton-WMF
ResolvedJMonton-WMF
ResolvedJMonton-WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedJMonton-WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
OpenJMonton-WMF
OpenNone
OpenNone
OpenOttomata
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
ResolvedAKhatun_WMF
OpenAKhatun_WMF
OpenAKhatun_WMF

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Update on this request: we have asked Data Platform Engineering to prioritize T360794 as what we need for being able to continue the work for defining moderation actions. This current task will eventually will need tackling but in the spirit of stacking requests, and given that we still need a decision for SDS 1.2.3 recommendations, I'm moving this task to Backlog for Research. We can bring it back to the right column when the need/situation changes.

Update: we're moving this to the Research Freezer as we don't expect to work specifically on this in the next six months. That's not to say progress isn't being made, but we have a separate ask related to an HTML dataset on the cluster (T380874) and update the mwedittypes library to handle HTML diffs (T378617) that we think is important to this Edit Types work covering our needs. When those are resolved, we'll be in a better place to pick this work up.

Leaving Data Engineering tags as I see this as a core dataset from which a lot of derived data could be generated for various stakeholders (as opposed to a more business-case-specific dataset with a clear owner).

We recently decided to pursue both html and edit types in event streams.

There are deadlines for the T410940: WE1.5.3 Productize Data for Monthly Active Moderator Actions dependency, so things are moving fast, so I don't have a lot of good docs or links other than slack threads. Some notes in this doc.

Our current approach will be:

This will get both datasets as tables in the data lake, and also make the reusable for other online use cases later.

There are a few unknowns with this approach, so we may encounter some road blocks that will cause us to change course. Until we reach them, we will move forward!

FYI: until we support a more generalized page_change based reconciliation process, these datasets will be best effort. There will be occasional data loss, but no more than other products are already accept.

I'd like to start a bikeshed around the 'edit type' name. Now that we have a better description of what will be in edit types data, I'm not so sure I'd think of this an 'edit type'. It is more like facts about what happened in the edit: How many links were added or removed. How many messagebox templates were modified or added, etc. It is not classifying the type of the edit, but these facts could be used to make that kind of classification.

Also, many of these 'facts' could (and are) computed about the actual revision content, for other purposes, e.g. # of ref counts for Attribution API (T417669), or number of wikilinks for input to articlequality model, etc. etc.

I think we can find a name for this that works for both facts about the delta (what changed between revisions because of the edit) and facts about the revision itself.

For 'productionized edit types', we are currently planning to only compute the HTML extracted 'simple edit types'. These only have fact change counts. The more comprehensive edit types output contains actual data changes, e.g. which links were added and removed. This a bit similar to intended output of T331399: Create new mediawiki links change streams based on fragment/mediawiki/state/change/page, but html based edit types are (re)parsed out of the html directly, and they include more than just links. All that is to say: when naming, let's remember that there is more than just the 'simple edit types'.

Name bikeshed incoming!

I like the 'fact' concept, but it is a bit broad. ML world sometimes calls these things 'features', which is also broad, but at least it has an established industry understood definition.

In ML world, would a list of links on a page be considered a 'feature'? IIUC, list of links are an input to the article 'outlink' topic prediction model. If so, I might lean towards promoting the use of the word 'feature' at WMF to mean this kind of thing, and also consider using it when naming edit types data.

To me, the term edit types seems not bad: it is identifying the semantic types of edits made in a revision. But I can see how it can easily be though of as edit classification. From README

Edit diffs and type detection for Wikipedia.

I would lean towards expressing that this is a diff, but of qualitative nature (also quantitative!)

Some terms floating in my head: Semantic diff, Edit actions, Structured Edit Summary.

So you could say: “The library produces a structured edit summary (or semantic diff) for each revision, consisting of a list of edit actions (or revision edit actions), each tagged with an edit type.”

For information about a revision (data computed about the revision), I would lean towards "revision metadata", which is already in use in some places (like https://meta.wikimedia.org/wiki/Page_metadata#Revision_metadata), but conceptually the term "metadata" definitely can include computed or ML-generated data that captures attributes or characteristics of a revision.

identifying the semantic types of edits made in a revision

Ah, interesting! This might get confusing because often when people say 'an edit' they mean the thing that caused a new revision to be created. 'Edit types' library is parsing that edit (diff) and producing an output of facts about the edit. 'edits made in a revision' implies that there are multiple edits in one revision, which might conflate the meaning of edit a bit. I see what you mean though, each little tweak or change in the larger Edit could be considered an 'edit'. But, I think to avoid confusion, we should probably avoid naming the little changes also 'edits'.

Candidates put forward for 'facts' about revisions (so far):

  • features
  • facts
  • elements
  • attributes
  • metadata
  • facets

I'd like to argue against metadata: I don't think there is a good enough line between what we might call metadata and data about a revision. For photographs, metadata kind of makes sense, e.g. pixel dimensions, camera stats, etc. It is clear that the 'data' is the actual image bytes, and the metadata is extra stuff that users of the image don't usually use or see.

For wikitext based pages, the fact that semantic/structured data (references, sections, etc.) is mixed inside of prose is a annoying implementation detail. References, categories, embedded images are seen by real users of wiki pages. You could imagine a MediaWiki that had this kind of structured data as product features (e.g. wikidata?), and you might not call it metadata anymore!

I don't think the line between 'data' and 'metadata' is strong enough to make a codified distinction here. 'meta' is context dependent! Which level meta are you at? Who's meta? ;)

(Copied from slack)

I’m liking ‘feature’ more than I originally had.  Even though it is an overloaded term (they all are), I like how it is less concrete then words like ‘fact’ or ‘attribute’.   It is a little more fuzzy.

E.g. a feature of a mountain could be: it is tall, or has many peaks. Or it could be X feet tall with 5 peaks.  It indicates a ‘detail of interest’, which I think data points we are discussing are.

I also like that I can link to a wikipedia definition of Feature, which I think describes pretty well what we are discussing.

Feature is already used for these data points in articlequality (and elsewhere?)

Even though ‘feature’ is an overloaded term, it isn’t really used otherwise in this particular context (software/data terms).  attribute,property,element, etc all are.

It may confuse non ML folks at first, but I think with good docs and qualifiers (data features) it could be okay.

Change #1249360 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/deployment-charts@master] stream: mediawiki.page_edit_type_simple

https://gerrit.wikimedia.org/r/1249360

Change #1249367 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/mediawiki-config@master] stream: mediawiki.page_edit_type_simple.dev0

https://gerrit.wikimedia.org/r/1249367

Change #1249957 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/puppet@production] topic: mw-page-edit-type-enrich-next

https://gerrit.wikimedia.org/r/1249957

Change #1249367 merged by jenkins-bot:

[operations/mediawiki-config@master] stream: mediawiki.page_edit_type_simple.dev0

https://gerrit.wikimedia.org/r/1249367

Mentioned in SAL (#wikimedia-operations) [2026-03-10T14:01:03Z] <otto@deploy2002> Started scap sync-world: Backport for [[gerrit:1249367|stream: mediawiki.page_edit_type_simple.dev0 (T351225)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-10T14:02:54Z] <otto@deploy2002> akhatun, otto: Backport for [[gerrit:1249367|stream: mediawiki.page_edit_type_simple.dev0 (T351225)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-10T14:12:08Z] <otto@deploy2002> Finished scap sync-world: Backport for [[gerrit:1249367|stream: mediawiki.page_edit_type_simple.dev0 (T351225)]] (duration: 11m 05s)

@AKhatun_WMF related to the naming bikeshed: so far we have been for the most part basing our schema on fragment/mediawiki/state/change/page.

We don't have to. We do want to use our entity data model conventions, but we don't necessarily have to represent this stream as a 'changelog'. If we don't, we don't need fields like changelog_kind, etc. We wouldn't have to pass non edit-type compatible events through (being considered in this review comment).

I think edit types data might be different enough to warrant not modeling this like a page change(log) event.

[Initial Test Version] State of edit type schema and Flink app

sink: development/html_edit_type_simple:2.0.0
source: development/html_change:4.0.0

  1. Considers edit type a changelog event (basing our schema on fragment/mediawiki/state/change/page)
  2. html and html_diff are always deleted before edit type event is emitted (to prevent passing around large data that is not required)
  3. As per initial code review comment, enriches with edit type if content_model==wikitext AND both html and html_diff content_body is present
    • If not, the event is still produced but without the edit-types
    • Note that html and html_diff is still stripped. Meaning this event is sort of a bare-bone thing.
Todo:
  1. With respect to <1> from above, we can change things and not consider edit-type event a change log. (Pending edit-type name bikeshedding and schema changes). This means more fields can be ignored or stripped.
    • If schema does not contain fields, but is sent in event, they are simply ignored. So explicit .pop(key) is not required but can be done for completeness and/or simplifying event.
  2. With respect to <3>, TBD not emitting certain kinds of events (create, delete, etc)
  3. Edit type schema will need a mwedittype library version field, and possibly a content_language field, and content_model field.
  4. Monitoring [long tail fixes]:
    • Looking for logs (fix debug/error/info)
    • Look for messages in error sink (why errors occurred, bugs?)
    • SLO definition, alerting
Current CR/MRs for Test release:

I think this should be all the CRs, let me know if anything more is required.

Hi @BTullis, I'd need the s3 users created for the mw-page-edit-type-enrich-next pipeline as well (same as https://phabricator.wikimedia.org/T360794#11587133), please and thank you!

Oh and +2 here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1249957

brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=mw-page-edit-type-enrich-next --display-name="mw-page-edit-type-enrich-next"
{
    "user_id": "mw-page-edit-type-enrich-next",
    "display_name": "mw-page-edit-type-enrich-next",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "subusers": [],
    "keys": [
        {
            "user": "mw-page-edit-type-enrich-next",
            "access_key": "[REDACTED]",
            "secret_key": "[REDACTED]"
        }
    ],
    "swift_keys": [],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "default_storage_class": "",
    "placement_tags": [],
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw",
    "mfa_ids": []
}

I added the keys to hieradata/role/common/deployment_server/kubernetes.yaml in the private git repo, so they are now available to helmfile.

Change #1249957 merged by Btullis:

[operations/puppet@production] topic: mw-page-edit-type-enrich-next

https://gerrit.wikimedia.org/r/1249957

Change #1249360 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mediawiki.page_edit_type_simple

https://gerrit.wikimedia.org/r/1249360

Change #1251111 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] dse-k8s-eqiad: add mw-page-edit-type-enrich-next to the flink tenant namespaces

https://gerrit.wikimedia.org/r/1251111

Change #1251111 merged by Brouberol:

[operations/deployment-charts@master] dse-k8s-eqiad: add mw-page-edit-type-enrich-next to the flink tenant namespaces

https://gerrit.wikimedia.org/r/1251111

Change #1251130 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/deployment-charts@master] stream: remove unwanted params in edit-type stream

https://gerrit.wikimedia.org/r/1251130

Change #1251130 merged by jenkins-bot:

[operations/deployment-charts@master] stream: remove unwanted params in edit-type stream

https://gerrit.wikimedia.org/r/1251130

Change #1251480 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/deployment-charts@master] stream: deploy edit-type stream to production

https://gerrit.wikimedia.org/r/1251480

Change #1251480 merged by jenkins-bot:

[operations/deployment-charts@master] stream: deploy edit-type stream to production

https://gerrit.wikimedia.org/r/1251480

@AKhatun_WMF and I met today to try and make some progress on the 'edit types' event data model, naming and 'changelog' semantics (T351225#11685398).

Here's what we came up with:

stream name will be something like:
mediawiki.page_html_feature_counts_change and respect changelog semantics. That is: it will pass thru all of the page_change events even if no feature changes were computed.

We struggled with what we might name a future data product that contained precomputed feature counts ( # references, # wikilinks, etc.) if we name this one that contains the 'delta' between revisions. We considered something like mediawiki.page_html_feature_counts_change for the revision state data product name, and mediawiki.page_html_feature_counts_delta (and possibly dropping changelog semantics) for this 'edit type' diff summary. We didn't love it.

Along the way we wondered if we should have named page_change page_changelog instead, to avoid the confusing overloaded meaning of 'change' here.

Anyway, partially to avoid the deliberation on 'is this a changelog stream, what is the delta stream called, etc.' question, we think we can go with mediawiki.page_html_feature_counts_change for this now, with the intention of also included future precomputed features about the latest page state in the same stream.

So for this task, mediawiki.page_html_feature_counts_change will have fields something like:

delta:
  revision:
    rendering:
      feature_counts:
        wikilinks: 
          inserted: 5
        messagebox:
          removed: 1

In the future, when we want to also emit feature counts about the latest page revision state, we will add a field like:

revision:
  rendering:
     # ...
     feature_counts:
       wikilinks: 3
       messagebox: 2
       # ...

We also discussed how this actually kind of lines up with the delta.revision.rendering.html_diff field we are adding in T360794: Event stream with latest revision HTML & parent revision HTML diff. The delta convention is a different and potentially more compressed way of representing the prior_state. delta.revision.rendering.feature_counts can be reverse applied to a hypothetical TODO revision.rendering.feature_counts field to get the feature_counts for the prior_state.

If we settle on this we will certainly update docs at Event_Platform/Schemas/Guidelines#Modeling_state_changes.

Change #1255017 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill

https://gerrit.wikimedia.org/r/1255017

Change #1255017 merged by jenkins-bot:

[operations/deployment-charts@master] mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill

https://gerrit.wikimedia.org/r/1255017

Change #1259186 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/deployment-charts@master] stream: mw-page-edit-type-enrich-next

https://gerrit.wikimedia.org/r/1259186

Change #1259186 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-edit-type-enrich-next

https://gerrit.wikimedia.org/r/1259186

Change #1260060 had a related patch set uploaded (by AKhatun; author: AKhatun):

[operations/deployment-charts@master] stream: mw-page-edit-type-enrich-next

https://gerrit.wikimedia.org/r/1260060

Change #1260060 merged by jenkins-bot:

[operations/deployment-charts@master] stream: mw-page-edit-type-enrich-next

https://gerrit.wikimedia.org/r/1260060

Change #1260091 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams

https://gerrit.wikimedia.org/r/1260091

Change #1260091 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams

https://gerrit.wikimedia.org/r/1260091

Mentioned in SAL (#wikimedia-operations) [2026-03-25T13:42:01Z] <otto@deploy2002> Started scap sync-world: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-25T13:44:21Z] <otto@deploy2002> otto: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-25T13:49:49Z] <otto@deploy2002> Finished scap sync-world: Backport for [[gerrit:1260091|EventStreamConfig - Increase spark_job_ingestion_scale for larger event streams (T360794 T351225)]] (duration: 07m 48s)