Page MenuHomePhabricator

[SPIKE] 📊 Research options for real-time processing
Open, Needs TriagePublic5 Estimated Story Points



The purpose of this spike is to gain an understanding of how to access the revisions Kafka topic, as well as feedback generated by Android clients. We will need this to support keeping track of page status changes between model runs.

Acceptance Criteria

  • Proof of concept for a simple consumer to track model state updates
  • Identify data model and a (basic) of requirements
  • Document (initial) approaches to stream processing in use at WMF, and how they could integrate with ImageMatching batch and serving layers.

Event Timeline

gmodena renamed this task from [Placeholder] 📊 Research options for real-time processing to [SPIKE] 📊 Research options for real-time processing.May 3 2021, 7:26 PM
gmodena claimed this task.
gmodena updated the task description. (Show Details)
gmodena set the point value for this task to 5.

The Image Matching model is trained with a monthly schedule. During the month, the state of a recommendation can change. For example:

  • A recommendation has been rejected and should not be offered again.
  • A recommendation has been accepted, a page is now illustrated and should not receive further recommendations.
  • A page has been illustrated by a workflow external to ImageMatching and should not receive further recommendations.

With our current setup, we’ll need to wait till the new training completes to see changes reflected in data.

We need a system in place that tracks state changes in near real-time.

Data model

At WMF real-time data is available through a multi-DC Kafka broker. Data is exposed via a public REST endpoint (EventStream), and (internally) to Kafka Consumers. For high volume systems, the documentation recommends interfacing with Kafka directly.

Events that alter a recommendation state can come from:
Revision-create kafka topic (tracks page changes). Eg.

  • Metadata of a revision

Android feedback topic.

  • Rating for a (wiki, page, image) recommendation.

Data model limitations

To the best of my knowledge we currently only stream metadata of revision changes. To track the actual change (was an image added or removed) we will need to access payload. Two approaches come to mind:

  1. For each revision, call back the revision API and generate a diff between previous and current content.
  2. Perform the diff upstream (at event generation) and annotate the revision-create schema accordingly (or derive a new topic from revision-create).

Functional requirements

  • Stateful application
  • Ability to process (join) multiple streams
  • State should be queryable by other systems

Stream processing approaches at WMF

Two stream processing technologies currently in use at WMF are change-prop and apache Flink. Other popular choices we _might_ consider are Kafka Streams and Spark Streaming.
Change-prop is a mature architecture, used for production workloads and onboarded by the broader SRE team. Flink while being feature rich is not a generally available system yet.


PoC code developed for this spike can be found at The stated goal of demoing was not satisfiable within the budgeted effort. A functionally equivalente aggregation query has been provided instead.
This PoC shows basic approaches to packaging Flink application, and the moving parts required for deploying clusters atop YARN on WMF's Hadoop cluster.