Page MenuHomePhabricator

Evaluate Flink Operator on DSE Kubernetes Cluster for deployment and management of stateful search applications
Open, Needs TriagePublic5 Estimated Story Points

Description

Search platform uses Flink for WCQS and WDQS, and will soon begin using it for the search update pipeline as well.

To increase stability and security for the above services, we'd like to run Flink Operator in Kubernetes. Additionally, the Event Platform team has expressed interest in Flink operator.

Creating this task to:

  • Gather requirements for Flink Operator (compute resources, k8s version, permissions inside the cluster, capabilities of limiting the blast radius of it etc)
  • Deploy on DSE cluster or another appropriate evaluation environment.
  • Prototype/Replicate rdf-streaming-updater deployment using the operator

Event Timeline

MPhamWMF set the point value for this task to 5.Oct 24 2022, 3:45 PM
bking renamed this task from Evaluate Flink Operator on Staging Kubernetes Cluster to Evaluate Flink Operator on DSE Kubernetes Cluster.Oct 31 2022, 1:24 PM
bking updated the task description. (Show Details)

Per last week's conversation with @dcausse :
The helm chart for operator needs to be modified for the DSE environment. @BTullis 's spark helm chart PR is probably a good template. Also note that @Ottomata is working on the Flink docker images .

Will reach out to the others to determine next steps.

Ottomata renamed this task from Evaluate Flink Operator on DSE Kubernetes Cluster to Evaluate Flink Operator on DSE Kubernetes Cluster for deployment and management of stateful search applications.Dec 6 2022, 3:32 PM
Ottomata updated the task description. (Show Details)

As far we can tell, the flink-kubernetes-operator will greatly ease the management overhead of deploying Flink applications. Almost all operations can be managed by the FlinkDeployment CRD and helm chart.

The main situation where we don't get any improvement is automated job restarts after a full k8s cluster upgrade. Kubernetes HA relies on ConfigMaps to store a pointer to the latest job state savepoint/checkpoinit (in an object store somewhere). When we upgrade k8s, we fully shut down k8s and all running services. When this happens, we lose the ConfigMaps. The current k8s upgrade process involves manually stopping the Flink job, noting the latest state pointer, and then restarting the job after the k8s upgrade starting from this state.

This process will continue to work with the flink-kubernetes-operator, but it would be nice if the HA state was managed outside of k8s, so that the application was not so dependent on the infrastructure it runs. Our only other current option would be to store the HA state in Zookeeper. This might be worth considering, although with the Zookeeper no longer needed for more recent Kafka versions, it would be a shame to keep it around just for Flink apps. There was a proposal to implement etcd based HA (FLINK-11105), but it was abandoned in 2020 after k8s ConfigMap HA support was implemented.