Page MenuHomePhabricator

Investigate using Spark Streaming as an Event Service Platform
Closed, ResolvedPublicSpike

Description

User Story
As a platform engineer, I need to evaluate Spark Streaming so that the group can use this analysis to decide on a single platform to implement
Done is:
  • Wiki Page updated with analysis, including: Pro's, Con's and Recommendations against the criteria

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptNov 3 2022, 1:32 PM
Connectors

There are a variety of supported Source connectors, but for Sinks, only Kafka and file systems are built in. It is possible to implement custom connectors for e.g. Cassandra.

Deployment

Pretty much the same options as Flink here. From a brief read, k8s support seems okay?

Checkpointing

HDFS compatible file system only?

KafkaSource
  • KafkaSource - offsets are stored in memory and in HDFS (compatible?) metadata log.
Enrichment usability

I just embarked on a quick exercise to create a pyspark version of this enrichment udf. While I got stuck on java dependencies, my experience was basically the same as doing this in Flink. Spark's python support seems a little better than Flinks.

I've already demoed creating a Event Platform based Spark Streaming DataFrarme here. (If we chose to invest in spark streaming, we'd abstract more of that, like we have for Flink with DataFrame factory functions and or a Catalog implementation). Defining the UDF is pretty much the same as in Flink, except you don't always have to specify the return type. I believe you do if the type is a complex/nested one, and in that case you'd use Spark's own DataType system, which is similar to Flink's.

Summary

This is a very quick evaluation, but honestly I'm not so sure Spark is that much easier than Flink in practice for what we are trying to do. The fact that there are no streaming sink connectors is a big downside.

Spark is great if you are ultimately working in a data lake, but for building a production stream processing platform, I'm not so sure.