Page MenuHomePhabricator

Investigate using session cluster for Flink
Closed, ResolvedPublic5 Estimated Story Points

Description

Different Kubernetes Deployment Modes Docs -> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/kubernetes.html#deployment-modes

Currently the plan is deploy Flink using the Deploy Application Cluster which runs a single application. This means the streaming updater jar is bundled in the Flink image and that stopping the streaming updater job in Flink stops the entire Flink application. This makes it difficult for the Flink job to be stopped with a savepoint, as the Flink application cannot stay alive without the running Streaming Updater job.

It was suggested to use the Session Cluster, which would allow us to stop the job cleanly without stopping the entire Flink Application. This would allow any jobs to be stopped with savepoints (via the API or UI) and also started again from savepoints (again via API or UI). This would also allow multiple jobs to uploaded to one session cluster so that the WCQS (commons query service) updater could be run on the same Flink session cluster. Jars are uploaded to the session cluster, and in Flink HA mode, they are saved in Swift. In the event that the Session cluster is shut down by SRE, any running jobs will be still resumed using the state stored in the HA configmaps.

AC:

  • documentation on what Session mode is (high level documentation, use case, links to Flink official doc)
  • common understanding between SRE and Search Platform about using Session mode
  • configuration change to enable session mode

Event Timeline

Gehel updated the task description. (Show Details)

Change 681495 had a related patch set uploaded (by Mstyles; author: Mstyles):

[wikidata/query/flink-rdf-streaming-updater@master] remove streaming updater jar

https://gerrit.wikimedia.org/r/681495

Change 681497 had a related patch set uploaded (by Mstyles; author: Mstyles):

[operations/deployment-charts@master] rdf-streaming-updater: use session mode

https://gerrit.wikimedia.org/r/681497

Change 671204 had a related patch set uploaded (by DCausse; author: Mstyles):

[operations/deployment-charts@master] rdf-streaming-updater: switch to H/A session-cluster

https://gerrit.wikimedia.org/r/671204

Change 681497 abandoned by DCausse:

[operations/deployment-charts@master] rdf-streaming-updater: use session mode

Reason:

squashed into parent

https://gerrit.wikimedia.org/r/681497

dcausse assigned this task to Mstyles.

We decided to go with the session cluster for now and evaluate moving to app cluster once we have more experience running flink over k8s

Change 671204 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updater: switch to H/A session-cluster

https://gerrit.wikimedia.org/r/671204

Change 681495 merged by jenkins-bot:

[wikidata/query/flink-rdf-streaming-updater@master] remove streaming updater jar

https://gerrit.wikimedia.org/r/681495