Goal
Produce a dataset of edit types for all edits to Wikipedia articles (namespace 0; non-redirect) that is available on HDFS. I can see two approaches that we might want to consider:
- Batch: monthly Airflow job based on mediawiki_history and mediawiki_wikitext_history that produces edit types for all of last month's edits. This is simpler from an organizational perspective (less teams probably involved) but likely harder from a technical perspective.
- Stream: based on page-change (much like mediawiki.page_outlink_topic_prediction_change.v1 stream for the articletopic-outlink model) that produces the edit types for all edits as they happen and saves them to an event table on HDFS. This feels like the better long-term solution but certainly requires more coordination.
In theory, they're both useful for analytics purposes but the stream could also potentially be used in Products (as input to revert-risk or other models; eventually filters for RecentChanges etc.) and computing in bulk from mediawiki_history is an expensive/slow operation because it requires a lot of shuffling wikitext. The outlier diffs can be pretty expensive too, so computing each diff individually in a stream helps failures from cascading to affect other diff computations.
Tasks
- Batch job:
- Isolate relevant edits and their associated metadata (easy)
- Bring together current and previous wikitext pairs for every revision (lots of shuffling)
- Compute edit types for these wikitext pairs (lots of computation; occasional outlier with huge memory consumption / time)
- Stream job:
- Apply edit filters to input page-change stream -- i.e. Wikipedia + namespace 0 + not redirect
- Fetch current and parent wikitext from API (or perhaps consume from page-change-based stream that already has the current and potentially even parent wikitext?)
- Compute edit types and add to new stream
Context
There are a number of spaces where I envision this being useful:
- Large-scale analyses of edit / content dynamics on wiki -- e.g., akin to T334760#8782740 (batch or stream work)
- Smaller-scale aggregations of data about edits for user-facing tools such as campaign dashboards (e.g., how many references were added by this campaign) or user stat pages (e.g., you've added 10 references this month).
- As a stream that could be consumed by other LiftWing models to determine if they should be triggered -- e.g., perhaps we eventually have a model for analyzing URLs for fact-checking but that only needs to be triggered if an edit actually inserts/changes a URL on the page; or the readability model only should be triggered when page text changes?
There are still some open questions that we'll have to address:
- What "edit types" to store? The library can produce a variety of outputs from the very raw to the more refined:
- Basic: what types of nodes (References, Text, etc.) changed
- This can also include the specific details of the change – e.g., what part of the Reference changed, which words changed, etc.
- Refined: high-level categories like edit size, edit difficulty, edit category
- Basic: what types of nodes (References, Text, etc.) changed
- Depending on the type of input, this also affects whether we use the Simple (and less prone to fail) version of the library vs. the Complex/Structured (and more prone to fail due to memory errors) version of the library.