Goal
Produce a dataset of edit types for all edits to Wikipedia articles (namespace 0; non-redirect) that is available on HDFS. I can see two approaches that we might want to consider:
- Batch: monthly Airflow job based on mediawiki_history and mediawiki_wikitext_history that produces edit types for all of last month's edits. This is simpler from an organizational perspective (less teams probably involved) but likely harder from a technical perspective.
- Stream: based on page-change (much like mediawiki.page_outlink_topic_prediction_change.v1 stream for the articletopic-outlink model) that produces the edit types for all edits as they happen and saves them to an event table on HDFS. This feels like the better long-term solution but certainly requires more coordination.
In theory, they're both useful for analytics purposes but the stream could also potentially be used in Products (as input to revert-risk or other models; eventually filters for RecentChanges etc.) and computing in bulk from mediawiki_history is an expensive/slow operation because it requires a lot of shuffling wikitext. The outlier diffs can be pretty expensive too, so computing each diff individually in a stream helps failures from cascading to affect other diff computations.
Tasks
- Batch job:
- Isolate relevant edits and their associated metadata (easy)
- Bring together current and previous wikitext pairs for every revision (lots of shuffling)
- Compute edit types for these wikitext pairs (lots of computation; occasional outlier with huge memory consumption / time)
- Stream job:
- Apply edit filters to input page-change stream -- i.e. Wikipedia + namespace 0 + not redirect
- Fetch current and parent wikitext from API (or perhaps consume from page-change-based stream that already has the current and potentially even parent wikitext?)
- Compute edit types and add to new stream
Context
There are a number of spaces where I envision this being useful:
- Large-scale analyses of edit / content dynamics on wiki -- e.g., akin to T334760#8782740 (batch or stream work)
- Smaller-scale aggregations of data about edits for user-facing tools such as campaign dashboards (e.g., how many references were added by this campaign) or user stat pages (e.g., you've added 10 references this month).
- As a stream that could be consumed by other LiftWing models to determine if they should be triggered -- e.g., perhaps we eventually have a model for analyzing URLs for fact-checking but that only needs to be triggered if an edit actually inserts/changes a URL on the page; or the readability model only should be triggered when page text changes?
There are still some open questions that we'll have to address:
- What "edit types" to store? The library can produce a variety of outputs from the very raw to the more refined:
- Basic: what types of nodes (References, Text, etc.) changed
- This can also include the specific details of the change – e.g., what part of the Reference changed, which words changed, etc.
- Refined: high-level categories like edit size, edit difficulty, edit category
- Basic: what types of nodes (References, Text, etc.) changed
- Depending on the type of input, this also affects whether we use the Simple (and less prone to fail) version of the library vs. the Complex/Structured (and more prone to fail due to memory errors) version of the library.
Status Updates
- 2026-02 - We will be pursing this ticket along with T360794: Event stream with latest revision HTML & parent revision HTML diff in order to emit both revision html and 'simple' edit types data to 2 different streams to support more use cases.
- 2026-03 - edit types dev enrichment job is deployed in dse-k8s, consuming from page_html_change with diff and emitting edit types events to kafka jumbo-eqiad.
- 2026-03 edit types event schema - the data model is mostly set based on conventions defined in T415158. We need to do some data product and field name bikeshedding, but the shape of the data is not expected to change.
To Do
- Implement and deploy simple edit types streaming enrichment job - PoC out as of 2026-03
- finalize 'simple edit types' event schema
- Finalize 'simple edit types' stream data product name (html-feature-counts-change)
Done is
- simple edit types streaming enrichment job is released and producing .v1 events to kafka jumbo-eqiad
- simple edit type events are being ingested into a _v1 Hive table.
Follow ups
After the productionized edit type stream is released, we will still have some follow ups to do to ensure maintainability of the pipeline. Mostly these will be under T418996: Audit and fix observability (logging and metrics) for pyflink jobs, but there may be other tickets to create. These do not block the resolution of this ticket.
[Done] As of 2026-03-16, there are several #TODO comments in the enrichment pipeline we need to follow up on too.