Page MenuHomePhabricator

Flink Spike
Closed, DeclinedPublic

Description

In one form or another the resolution of this problem involves stateful computations over event data, seems like apache flink might be of help and we should evaluate whether it really is: https://flink.apache.org/

Event Timeline

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Returns: a count of all edits for a given page

This use case (from T240387) is relatively simple and wouldn't require emitting any new events. The revision-create event has the page id and page title. For this spike, we could attempt to implement a service returning realtime edit counts per page using Flink.

https://flink.apache.org/usecases.html

For most of the use cases there we can generate historical events and store them somewhere, that way we can replay them through our Flink compute layer whenever we want. So one decision is where to store those kinds of events.

Like revision-create? HDFS/Hive is fine no?

A benefit of using Flink to process these is it treats batch as a special case of a bounded stream. Aside from some job setup, the same code that processes the stream data could be used to process the batch data.

If you mean where to store the up to date counts that will be used to return results, e.g. edit counts per page, that does need some thought. I think you can build a queryable state API with Flink, where results are saved in local processor DBs and updated via checkpoints and Kafka, but I'm not sure if this is what we'd want to do. The jobs could simply update an external DB with the counts.

HDFS is fine to store events, given Flink is agnostic to fetching batches/streams. That's great.

Where to put the output is a great question, and neither Druid nor Cassandra seem like very good options. I wonder if we can find a single solution that would be flexible enough for all our API needs.

Unneeded here, but we might use Flink for other things in the future.