Page MenuHomePhabricator

[Research Engineering Request] Productionized Edit Types
Open, LowPublic

Description

Goal

Produce a dataset of edit types for all edits to Wikipedia articles (namespace 0; non-redirect) that is available on HDFS. I can see two approaches that we might want to consider:

  • Batch: monthly Airflow job based on mediawiki_history and mediawiki_wikitext_history that produces edit types for all of last month's edits. This is simpler from an organizational perspective (less teams probably involved) but likely harder from a technical perspective.
  • Stream: based on page-change (much like mediawiki.page_outlink_topic_prediction_change.v1 stream for the articletopic-outlink model) that produces the edit types for all edits as they happen and saves them to an event table on HDFS. This feels like the better long-term solution but certainly requires more coordination.

In theory, they're both useful for analytics purposes but the stream could also potentially be used in Products (as input to revert-risk or other models; eventually filters for RecentChanges etc.) and computing in bulk from mediawiki_history is an expensive/slow operation because it requires a lot of shuffling wikitext. The outlier diffs can be pretty expensive too, so computing each diff individually in a stream helps failures from cascading to affect other diff computations.

Tasks

  • Batch job:
    • Isolate relevant edits and their associated metadata (easy)
    • Bring together current and previous wikitext pairs for every revision (lots of shuffling)
    • Compute edit types for these wikitext pairs (lots of computation; occasional outlier with huge memory consumption / time)
  • Stream job:
    • Apply edit filters to input page-change stream -- i.e. Wikipedia + namespace 0 + not redirect
    • Fetch current and parent wikitext from API (or perhaps consume from page-change-based stream that already has the current and potentially even parent wikitext?)
    • Compute edit types and add to new stream

Context

There are a number of spaces where I envision this being useful:

  • Large-scale analyses of edit / content dynamics on wiki -- e.g., akin to T334760#8782740 (batch or stream work)
  • Smaller-scale aggregations of data about edits for user-facing tools such as campaign dashboards (e.g., how many references were added by this campaign) or user stat pages (e.g., you've added 10 references this month).
  • As a stream that could be consumed by other LiftWing models to determine if they should be triggered -- e.g., perhaps we eventually have a model for analyzing URLs for fact-checking but that only needs to be triggered if an edit actually inserts/changes a URL on the page; or the readability model only should be triggered when page text changes?

There are still some open questions that we'll have to address:

  • What "edit types" to store? The library can produce a variety of outputs from the very raw to the more refined:
    • Basic: what types of nodes (References, Text, etc.) changed
      • This can also include the specific details of the change – e.g., what part of the Reference changed, which words changed, etc.
    • Refined: high-level categories like edit size, edit difficulty, edit category
  • Depending on the type of input, this also affects whether we use the Simple (and less prone to fail) version of the library vs. the Complex/Structured (and more prone to fail due to memory errors) version of the library.

Event Timeline

We reviewed this task in the backlog grooming meeting on November 21st. Given the limited capacity on the engineering front at this time and prioritization discussions (input by @fkaelin and @Miriam) we decided to prioritize T351674 instead. We will keep this task here as it is possible that we can pick it up in the coming 6 months. We will review it again in the future backlog grooming meetings.