[SPIKE] Assess what is required for the enrichment pipeline to run on k8s
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	gmodena
	Aug 17 2022, 10:38 AM

Description

UPDATE: Research phase complete. This ticket has been closed. The conversation will continue on Media Wiki. Click here to access the discussion page for this topic.

To bridge the gap between dev and prod environments we would like to run jobs on k8s.

Our use case is described Use case: compute needs for streaming pipelines.

The goal of this Spike is to determine if local or WMF Cloud based k8s instances can be suitable environments for learning, experimentation and development.
We would like to collect info to make an informed decision about the following:

do we want to invest resources developing k8s capabilities for development productivity and testing?
do we want to invest resources improving our release and deployment cycles targeting yarn?

The two are not mutually exclusive. Discarding this work for now is ok too.

Success criteria

Mediawiki Stream Enrichment can run on k8s (minikube) consuming synthetic data.

References:

Related Objects

Mentioned In: T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster
T275551: Using docker in WMF production network outside of kubernetes

Event Timeline

gmodena created this task.Aug 17 2022, 10:38 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2022, 10:38 AM

gmodena moved this task from Backlog to Sprint 00 on the Event-Platform board.Aug 17 2022, 1:16 PM

gmodena edited projects, added Event-Platform (Sprint 00); removed Event-Platform.

gmodena renamed this task from [SPIKE][NEEDS GROOMING] Flink enrichment pipline should run on k8 to [SPIKE][NEEDS GROOMING] Assess what is required for the enrichment pipline to run on k8.Aug 17 2022, 3:09 PM

gmodena moved this task from Next Up to In Progress on the Event-Platform (Sprint 00) board.

gmodena added a project: Spike.

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptAug 17 2022, 3:10 PM

gmodena updated the task description. (Show Details)Aug 18 2022, 10:48 AM

gmodena mentioned this in T275551: Using docker in WMF production network outside of kubernetes.Aug 24 2022, 7:34 PM

gmodena renamed this task from [SPIKE][NEEDS GROOMING] Assess what is required for the enrichment pipline to run on k8 to [SPIKE] Assess what is required for the enrichment pipline to run on k8.Aug 29 2022, 12:06 PM

Spike summary

I explored with adjusting the k8 workshop to Apache Flink. It boils down to running Flink on minikube. This can be done locally, without the need of a cloud vps vm.

Following are some consideration to bring into the next grooming seession.

I'd say that Could VPS would not buy us much, other than _potentially_ granting multi users access to a self-hosted minikube - or expose a public facing service. I don't think we want to go down the path of maintaining either (for dev workflows).

Setting up minikube is a well documented and straightforward process (at least on macOS/linux).
For running Flink on k8, I explored two paths:

Adjusting the Search flink-session-cluster helm charts.
Using the recently release Apache Flink Kubernetes Operator.

While for production use cases we should clearly adopt 1), both approaches offer interesting angles for experimentation and local development.

Path 1) requires a Docker image and decoupling the charts from the specific use case and WMF envs (https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/flink-session-cluster/values.yaml). We should consider contributing to a generic enough config, and make the setup more self service for developers (that want to run things on minikube).

Path 2) was easier to setup "out of the box". Setting up Cluster deployments that can accept Job submission either interactively or programmatically is well documented https://github.com/apache/flink-kubernetes-operator/tree/main/examples. The tutorial at
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/ gives the basic building blocks for setting up a Flink Cluster ready to accept jobs.

gmodena moved this task from In Progress to In Review on the Event-Platform (Sprint 00) board.Aug 29 2022, 2:22 PM

gmodena renamed this task from [SPIKE] Assess what is required for the enrichment pipline to run on k8 to [SPIKE] Assess what is required for the enrichment pipeline to run on k8.Aug 29 2022, 2:27 PM

gmodena updated the task description. (Show Details)Aug 29 2022, 3:15 PM

For reference, some resources on how Google and Spotify are operating Flink on k8:

gmodena updated the task description. (Show Details)Aug 30 2022, 9:08 AM

gmodena updated the task description. (Show Details)Aug 30 2022, 9:13 AM

JArguello-WMF edited projects, added Event-Platform (Sprint 01); removed Event-Platform (Sprint 00).Sep 1 2022, 12:12 PM

JArguello-WMF moved this task from Next Up to In Review on the Event-Platform (Sprint 01) board.

@gmodena thanks for exploring these k8s deployment options!
Something I used to test H/A capabilities (restarts&recovery) was https://min.io/ with minikube, I might still have some config examples and I remember it was not quite trivial to setup, but most probably because of my lack of knowledge of k8s.
Making the current flink-session-cluster helm chart more generic is definitely something that sounds valuable in the short/mid-term.
For the long term I wish we can explore using the apache flink-kubernetes-operator in production, the hope is that it could solve some the pain points we have regarding k8s and job management.

xcollazo subscribed.Sep 1 2022, 3:48 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Sep 6 2022, 10:40 AM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Sep 6 2022, 10:47 AM

fkaelin subscribed.Sep 7 2022, 1:39 PM

akosiaris updated the task description. (Show Details)Sep 8 2022, 8:14 AM

akosiaris added a subscriber: JMeybohm.

akosiaris subscribed.

akosiaris renamed this task from [SPIKE] Assess what is required for the enrichment pipeline to run on k8 to [SPIKE] Assess what is required for the enrichment pipeline to run on k8s.Sep 8 2022, 8:18 AM

akosiaris added a subscriber: jijiki.Sep 8 2022, 8:23 AM

JArguello-WMF closed this task as Resolved.Sep 15 2022, 4:24 PM

JArguello-WMF updated the task description. (Show Details)

gmodena mentioned this in T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.Oct 27 2022, 12:51 PM

bking awarded a token.Oct 27 2022, 4:45 PM