Investigate using session cluster for Flink
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Mstyles
	Apr 14 2021, 4:41 PM

Description

Different Kubernetes Deployment Modes Docs -> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/kubernetes.html#deployment-modes

Currently the plan is deploy Flink using the Deploy Application Cluster which runs a single application. This means the streaming updater jar is bundled in the Flink image and that stopping the streaming updater job in Flink stops the entire Flink application. This makes it difficult for the Flink job to be stopped with a savepoint, as the Flink application cannot stay alive without the running Streaming Updater job.

It was suggested to use the Session Cluster, which would allow us to stop the job cleanly without stopping the entire Flink Application. This would allow any jobs to be stopped with savepoints (via the API or UI) and also started again from savepoints (again via API or UI). This would also allow multiple jobs to uploaded to one session cluster so that the WCQS (commons query service) updater could be run on the same Flink session cluster. Jars are uploaded to the session cluster, and in Flink HA mode, they are saved in Swift. In the event that the Session cluster is shut down by SRE, any running jobs will be still resumed using the state stored in the HA configmaps.

AC:

documentation on what Session mode is (high level documentation, use case, links to Flink official doc)
common understanding between SRE and Search Platform about using Session mode
configuration change to enable session mode

Details

Subject	Repo	Branch	Lines +/-
remove streaming updater jar	wikidata/query/flink-rdf-streaming-updater	master	+8 -86
rdf-streaming-updater: switch to H/A session-cluster	operations/deployment-charts	master	+270 -284
rdf-streaming-updater: use session mode	operations/deployment-charts	master	+58 -251

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T244590 [Epic] Rework the WDQS updater as an event driven application
Resolved	Gehel	T264006 Deploy Flink (rdf-streaming-updater) to kubernetes (k8s)
Resolved	Mstyles	T280166 Investigate using session cluster for Flink

Event Timeline

Mstyles created this task.Apr 14 2021, 4:41 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptApr 14 2021, 4:41 PM

Mstyles mentioned this in T273098: High Availability Flink.Apr 14 2021, 10:51 PM

Mstyles moved this task from Incoming to Current work on the Wikidata-Query-Service board.Apr 19 2021, 3:13 PM

Gehel updated the task description. (Show Details)Apr 19 2021, 3:57 PM

Gehel updated the task description. (Show Details)

Gehel updated the task description. (Show Details)Apr 19 2021, 4:00 PM

MPhamWMF set the point value for this task to 5.Apr 19 2021, 4:01 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Change 681495 had a related patch set uploaded (by Mstyles; author: Mstyles):

[wikidata/query/flink-rdf-streaming-updater@master] remove streaming updater jar

https://gerrit.wikimedia.org/r/681495

gerritbot added a project: Patch-For-Review.Apr 20 2021, 11:42 PM

Change 681497 had a related patch set uploaded (by Mstyles; author: Mstyles):

[operations/deployment-charts@master] rdf-streaming-updater: use session mode

https://gerrit.wikimedia.org/r/681497

Change 671204 had a related patch set uploaded (by DCausse; author: Mstyles):

[operations/deployment-charts@master] rdf-streaming-updater: switch to H/A session-cluster

https://gerrit.wikimedia.org/r/671204

Change 681497 abandoned by DCausse:

[operations/deployment-charts@master] rdf-streaming-updater: use session mode

Reason:

squashed into parent

https://gerrit.wikimedia.org/r/681497

We decided to go with the session cluster for now and evaluate moving to app cluster once we have more experience running flink over k8s

Change 671204 merged by jenkins-bot: