Page MenuHomePhabricator

Deployment strategy and hardware requirement for new Flink based WDQS updater
Closed, ResolvedPublic

Description

Our evaluation and proof of concept around Flink is moving forward. We need to start thinking about a deployment strategy. There are still a lot of unknowns, this is the start of the discussion, not the plan yet.

Context / problem space
See T244590 for the larger context.

The new WDQS update strategy is an event driven application. This is a stream processing application that needs facilities for reordering of events, management of late events and management of state and check-pointing. Flink provides for all those needs, was already envisioned as part of the Event Platform and is also being looked at by CPT for similar use cases.

Requirements
In our use case, Flink requires:

  • compute resources (CPU / RAM)
    • no numbers yet on how much resources we need, but the expectation is that the requirements are going to be similar to what we need for the current updater, which is sharing resources with Blazegraph
  • some local storage for state (which can be considered as transient)
    • our current estimate is that local state (partitioned accross multiple nodes) will be between 1GB and 10GB, but this needs to be refined
  • shared storage for check-pointing and savepoints
    • current strategy is to use HDFS, but other backends can be supported (NFS, Swift, full list)

Dependencies

  • initial state is expected to be loaded from HDFS on our Hadoop cluster (but could be uploaded to swift reusing the ML pipeline)
  • kafka (-main or -jumbo) to consume various event streams
  • wikidata Special:EntityData endpoint to enrich events with actual content
  • kafka to produce TTL stream
  • some system (TBD) for check-point storage
    • While current HDFS storage seems like a good fit, Analytics team cannot provide us a maintenance and guarantees for a production service (no current use case requires high availability).

Strategies
Since we don't have experience with Flink yet, the longer term use cases are still undefined, and addressing the updater issues for WDQS is time sensitive, it might make sense to have a short term intermediate solution and to evolve it in a longer term solution.

  • k8s: Flink itself has no persistent state, it might be a candidate for k8s. Kubernetes native support from Flink seems to still be experimental, but a standalone deployment seems viable
  • dedicated Flink cluster on new hardware (just for the WDQS use case)
  • shared Flink cluster on new hardware (shared cluster for WDQS and CPT use cases + additional future use cases)
  • dedicated Flink cluster collocated on existing WDQS hardware

Open questions

  • which technologies can we reuse (HDFS or other for check-pointing?)
  • should we have a solution dedicated to WDQS or think of a more general service? Start with a dedicated service, move to a generic one later?
  • who should own that service?

Event Timeline

akosiaris triaged this task as Medium priority.Mar 6 2020, 11:30 AM

I would like to get some more info on why our current event processing platform, change-propagation, is not suited for this purpose, and we need to introduce a new software. I suppose this has been done at some point in another task; if so a quick link would suffice :)

Also if this is true - do we have plans for migrating what change-propagation does now to this new platform?

I'd rather not see a proliferation of different platforms to do the same type of work.

@Joe the main reason for me is that we need to do state-full computation over multiple event streams:

  • we want to union multiple event streams
  • we want to reorder events
  • we want to keep a state (last seen rev per entity)

ChangeProp is a "mostly" stateless service (except for deduplication) and has not been designed (I think) as a framework you can build on top, adding new actions seems trivial but this is not exactly what we need.
Adapting ChangeProp to do windowing functions, combining streams, keeping a state that is consistent with kafka offsets seemed a bit complex to write from scratch in ChangeProp.
What flink provides is exactly that: a framework where you can build state-full apps and where event time semantic is at the core of its design (concept of watermarks).
Also I don't see flink as doing the same type of work than change prop, change prop being a configurable service to schedule actions based on events, flink is more a framework to do stream processing.

A nice feature of Flink is its support for both batch and stream processing. Ideally, we'd be able to build lambda architectures reusing most of the core data logic for streams and historical batch backfilling.

Also python support and SQL querying on streams. :)

Yeah, @Gehel analysis is correct - change-prop is pretty simple and doesn't support any of the advanced features pointed in the task. We have never really needed any of those, we mostly rely on the fact that systems updated by change-prop are idempotent and don't care for even order. And I do agree that adding all those features to change-prop is possible, but probably is a waste of time if off-the-shelf software exists.

Also if this is true - do we have plans for migrating what change-propagation does now to this new platform?

Why not. If the system proves to work reliably for such a complicated use-case with state management, we can definitely adopt it for simpler purely stateless use-case of change-prop. We can even reimagine what your job queue controller and change-prop can do with regards to more advanced deduplication/merging of jobs and dependency updates that could move us closer to more unified dependency tracking system.

TLDR, I support this message :)

I think it will be very helpful to have a design document for this service so we are all in the same page of what the flink install would do (as there are other projects currently evaluating flink as well). Can we get a google doc that goes over the design proposed with design diagrams and such?

While not a google doc, the parent ticket's description describes it pretty well: T244590: [Epic] Rework the WDQS updater as an event driven application

And we have a first version of a design document. This is still work in progress, feel free to comment!

random-ish update re: checkpoint storage after a chat with @Zbyszko: the current situation is that we're using thanos-swift cluster for wdqs flink checkpoints. This is meant to be a temporary allocation and wdqs to be eventually moved off thanos-swift cluster. Things have shifted a bit and we're building MOSS (misc object storage service) as a separate swift cluster exactly for these use cases (and more, T264291). The plan is thus to keep using thanos-swift for the time being until moss is online and then migrate the updater there.

This is a bit of a drive-by, but have we considered https://min.io/? I went a bit deeper than just the marketing and was impressed by their error-correcting implementation.

Zbyszko claimed this task.

Strategy was developed and is being implemented.