Our evaluation and proof of concept around Flink is moving forward. We need to start thinking about a deployment strategy. There are still a lot of unknowns, this is the start of the discussion, not the plan yet.
Context / problem space
See T244590 for the larger context.
The new WDQS update strategy is an event driven application. This is a stream processing application that needs facilities for reordering of events, management of late events and management of state and check-pointing. Flink provides for all those needs, was already envisioned as part of the Event Platform and is also being looked at by CPT for similar use cases.
In our use case, Flink requires:
- compute resources (CPU / RAM)
- no numbers yet on how much resources we need, but the expectation is that the requirements are going to be similar to what we need for the current updater, which is sharing resources with Blazegraph
- some local storage for state (which can be considered as transient)
- our current estimate is that local state (partitioned accross multiple nodes) will be between 1GB and 10GB, but this needs to be refined
- shared storage for check-pointing and savepoints
- current strategy is to use HDFS, but other backends can be supported (NFS, Swift, full list)
- initial state is expected to be loaded from HDFS on our Hadoop cluster (but could be uploaded to swift reusing the ML pipeline)
- kafka (-main or -jumbo) to consume various event streams
- wikidata Special:EntityData endpoint to enrich events with actual content
- kafka to produce TTL stream
- some system (TBD) for check-point storage
- While current HDFS storage seems like a good fit, Analytics team cannot provide us a maintenance and guarantees for a production service (no current use case requires high availability).
Since we don't have experience with Flink yet, the longer term use cases are still undefined, and addressing the updater issues for WDQS is time sensitive, it might make sense to have a short term intermediate solution and to evolve it in a longer term solution.
- k8s: Flink itself has no persistent state, it might be a candidate for k8s. Kubernetes native support from Flink seems to still be experimental, but a standalone deployment seems viable
- dedicated Flink cluster on new hardware (just for the WDQS use case)
- shared Flink cluster on new hardware (shared cluster for WDQS and CPT use cases + additional future use cases)
- dedicated Flink cluster collocated on existing WDQS hardware
- which technologies can we reuse (HDFS or other for check-pointing?)
- should we have a solution dedicated to WDQS or think of a more general service? Start with a dedicated service, move to a generic one later?
- who should own that service?