New Service Request: flink-kubernetes-operator
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Mar 29 2023, 4:02 PM

Description

Description: flink-kubernetes-operator handles running 'native' Flink clusters in Kubernetes. It allows us to describe FlinkDeployment k8s resources and manage the lifecycle of running Flink applications.

flink-kubernetes-operator has already been developed, deployed, and tested in the dse-k8s-eqiad cluster. We'd like to deploy it to wikikube staging, eqiad, and codfw in order to run Flink apps there.

Timeline: Before 2023-04
Diagram:

flink-kubernetes-operator.svg2 MBDownload

Deployment checklist

Review charts: T316519: Flink application and flink-kubernetes-operator production docker images
deploy in staging-eqiad & staging-codfw
deploy in wikikube clusters

Details

Subject	Repo	Branch	Lines +/-
flink-operator - deploy in wikikube eqiad and codfw	operations/deployment-charts	master	+185 -62
Undeploy flink-operator and uncreate service namespace in staging-codfw	operations/deployment-charts	master	+9 -21
flink-operator: only deploy it to wikikube@stagings	operations/deployment-charts	master	+14 -8
admin_ng/flink-operator - set default rbac.create: false	operations/deployment-charts	master	+27 -0
Install flink operator in wikikube staging-eqiad	operations/deployment-charts	master	+20 -0
flink-operator - set default resource limits and requests in operatorPod	operations/deployment-charts	master	+7 -8
flink-operator - set default resource limits and requests	operations/deployment-charts	master	+8 -0
admin_ng/flink-operator - fix prometheus reporting configuration	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	gmodena	T307959 [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content
Resolved	Ottomata	T325303 Deploy mediawiki-page-content-change-enrichment to wikikube k8s
Resolved	Ottomata	T330507 New Service Request mediawiki-page-content-change-enrichment
Resolved	Ottomata	T333464 New Service Request: flink-kubernetes-operator

Event Timeline

Ottomata created this task.Mar 29 2023, 4:02 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMar 29 2023, 4:02 PM

Ottomata mentioned this in T330507: New Service Request mediawiki-page-content-change-enrichment.Mar 29 2023, 4:02 PM

Ottomata added a parent task: T330507: New Service Request mediawiki-page-content-change-enrichment.

Change 904226 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Install flink operator in wikikube staging-eqiad

https://gerrit.wikimedia.org/r/904226

gerritbot added a project: Patch-For-Review.Mar 29 2023, 4:22 PM

Ottomata updated the task description. (Show Details)Mar 29 2023, 4:22 PM

Ottomata updated the task description. (Show Details)Mar 29 2023, 4:24 PM

Ottomata moved this task from Backlog to Sprint 10 on the Event-Platform board.Mar 29 2023, 4:26 PM

Ottomata edited projects, added Event-Platform (Sprint 10); removed Event-Platform.

Ottomata moved this task from Next Up to In Progress on the Event-Platform (Sprint 10) board.

Maintenance_bot edited projects, added Data-Engineering; removed Data-Engineering-Planning.Mar 29 2023, 4:30 PM

JMeybohm updated the task description. (Show Details)Mar 30 2023, 7:06 AM

Could you please share resource requirements for the operator from your experiments on DSE here so that we know what to expect?

JArguello-WMF edited projects, added Data Pipelines (Sprint 11); removed Event-Platform (Sprint 10).Mar 31 2023, 2:39 PM

JArguello-WMF moved this task from Next Up to In Progress on the Data Pipelines (Sprint 11) board.

JArguello-WMF edited projects, added Event-Platform (Sprint 11); removed Data Pipelines (Sprint 11).Mar 31 2023, 2:42 PM

JArguello-WMF moved this task from Next Up to In Progress on the Event-Platform (Sprint 11) board.Mar 31 2023, 2:44 PM

JArguello-WMF removed a project: Epic.Apr 3 2023, 3:04 PM

Change 905295 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] admin_ng/flink-operator - fix prometheus reporting configuration

https://gerrit.wikimedia.org/r/905295

Change 905295 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng/flink-operator - fix prometheus reporting configuration

https://gerrit.wikimedia.org/r/905295

Clement_Goubert moved this task from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.Apr 4 2023, 9:44 AM

Could you please share resource requirements for the operator from your experiments on DSE here so that we know what to expect?

Should be pretty minimal. We're currently running one or two flink apps. pod mem is < 1G, CPU minimal.

The operator itself doesn't do much work. It mostly handles deployment of the JobManager pod(s) in the app's namespace. Beyond that the JobManager is responsible for talking with k8s to get the resources it needs to spawn TaskManagers.

More deployed flink-apps will mean more work for the operator, but I don't expect it every to be a resource hog.

We haven't yet experimented with operator HA, but if we do I expect we'd just run two operator replica pods.

BTW, IIUC T331283: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper is an app specific configuration. The flink app (not the flink k8s operator) stores the latest checkpoints, so this can even be different per app.

I'm going to remove this as a checklist item for this task, and put it in T330507.

Ottomata updated the task description. (Show Details)Apr 5 2023, 2:39 PM

In T333464#8759346, @Ottomata wrote:

Could you please share resource requirements for the operator from your experiments on DSE here so that we know what to expect?

Should be pretty minimal. We're currently running one or two flink apps. pod mem is < 1G, CPU minimal.

That's actually quite a bit of memory, I had expected less. :) From what I see you are running the operator completely unbound (e.g. no resource definitions). That is nothing we can do on wikikube and, as it's JVM, setting a memory limit probably changes the "requirement" as well. I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi for in a comment for example).

In T333464#8759356, @Ottomata wrote:

BTW, IIUC T331283: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper is an app specific configuration. The flink app (not the flink k8s operator) stores the latest checkpoints, so this can even be different per app.

I'm going to remove this as a checklist item for this task, and put it in T330507.

Indeed. But not knowing much about how this is configured I would very much like the "flink-app" chart to force users into using Zookeeper/the same HA service implementations. I can imagine this becoming quite messy if different flink clusters use different HA service implementations (even by accident).

I also remember open questions about the flink operator webhook. Did you figure out what it does, in which cases it might be required or expected? AIUI the recommended installation does set it up an currently we don't.

Ah, I now recall that was answered already:

In T324576#8456935, @Ottomata wrote:

what the webhook actually does

Responses from Flink mailing list:

webhooks in general are optional components of the k8s operator pattern. Mostly used for validation, sometimes for changing custom resources and handling multiple versions, etc. It's an optional component in the Flink Kubernetes Operator too.

Validation in itself is a mandatory step for every spec change that is submitted to guard against broken configs (things like negative parallelism etc).
But validation can happen in 2 places. It can be done through the webhook, which would result in upfront rejection of the spec on validation error.
Or it can happen during regular processing/reconciliation process in which case errors are recorded in the status .
The webhook is nice way to get validation error’s immediately but as you see it’s not necessary as validation would happen anyways .

So, I suppose its a nice to have, but not necessary for any functionality.

I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi for in a comment for example).

Oh, okay will do.

I would very much like the "flink-app" chart to force users into using Zookeeper/the same HA service implementations.

Agree, or at least provide the defaults. I'm not exactly sure how we'd do this though, other than in documentation or some helmfile/services template generator. I don't want to put hardcoded ZK connection info into the chart.

But, in either case it is an app related setting; it is possible to run in non HA / 'stateless' mode. We are doing this in DSE now while we wait for T330693.

Change 908310 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] flink-operator - set default resource limits and requests

https://gerrit.wikimedia.org/r/908310

Change 908310 merged by Ottomata:

[operations/deployment-charts@master] flink-operator - set default resource limits and requests

https://gerrit.wikimedia.org/r/908310

Change 908334 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] flink-operator - set default resource limits and requests in operatorPod

https://gerrit.wikimedia.org/r/908334

Change 908334 merged by Ottomata:

[operations/deployment-charts@master] flink-operator - set default resource limits and requests in operatorPod

https://gerrit.wikimedia.org/r/908334

I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi for in a comment for example).

Okay, nice. The JVM will automatically detect the container memory granted to it, and set max and starting Java Heap size accordingly. I set the containr memory to 512Mi, and now JVM Heap is stable between 50 and 50MB NonHeap memory does seem to very slowly increase, but not any more or less that it had before I limited container memory. Previously, NonHeap flattened out at around 130MB used. It looks like it did slowly increase, but at around a rate of < 1MB per day. We should keep an eye on that I guess, but it doesn't look like it is causing a problem?

@JMeybohm whatcha think?

JArguello-WMF edited projects, added Event-Platform (Sprint 12); removed Event-Platform (Sprint 11).Apr 24 2023, 12:59 PM

JArguello-WMF moved this task from Next Up to In progress on the Event-Platform (Sprint 12) board.

JArguello-WMF moved this task from In progress to In Review on the Event-Platform (Sprint 12) board.Apr 26 2023, 1:10 PM

Ottomata updated the task description. (Show Details)May 8 2023, 2:10 PM

@JMeybohm I'd like to proceed, but first we need to create the flink-operator namespace in staging-eqiad and staging-codfw. I seem to recall there was something more to it than just kubectl create namespace ..., right? Can you help with this?

Once the namespaces are created, I believe the deployment procedure is:

sudo -i
cd /srv/deployment-charts/helmfile.d/admin_ng
helmfile -e staging-codfw diff
# if looks good:
helmfile -e staging-codfw apply

You will need to add the namespace like you did in DSE (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/876200)

Change 904226 merged by jenkins-bot:

[operations/deployment-charts@master] Install flink operator in wikikube staging-eqiad

https://gerrit.wikimedia.org/r/904226

Change 917373 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] admin_ng/flink-operator - set default rbac.create: false

https://gerrit.wikimedia.org/r/917373

Change 917373 merged by Ottomata:

[operations/deployment-charts@master] admin_ng/flink-operator - set default rbac.create: false

https://gerrit.wikimedia.org/r/917373

Ottomata mentioned this in T336185: Enable HA failover for flink-kubernetes-operator.May 8 2023, 5:00 PM

Change 919108 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] flink-operator: only deploy it to wikikube@stagings

https://gerrit.wikimedia.org/r/919108

Change 919108 merged by jenkins-bot:

[operations/deployment-charts@master] flink-operator: only deploy it to wikikube@stagings

https://gerrit.wikimedia.org/r/919108

JArguello-WMF edited projects, added Event-Platform (Sprint 14 A); removed Event-Platform (Sprint 12).May 12 2023, 3:53 PM

JArguello-WMF moved this task from Next Up to In Review on the Event-Platform (Sprint 14 A) board.

JArguello-WMF updated the task description. (Show Details)May 16 2023, 1:09 PM

JArguello-WMF updated the task description. (Show Details)

Based on a convo with @akosiaris, we need to undeploy flink-operator in staging-codfw, as well as mw-page-content-change-enrich namespace there.

Change 922138 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Undeploy flink-operator and uncreate service namespace in staging-codfw

https://gerrit.wikimedia.org/r/922138

Working a bit on flink-kubernetes-operator dashboard; I think there might be a small bug in the operator lifecycle metrics.

bking subscribed.May 22 2023, 9:35 PM

Change 922874 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] flink-operator - deploy in wikikube eqiad and codfw

https://gerrit.wikimedia.org/r/922874

Change 922138 abandoned by Ottomata:

[operations/deployment-charts@master] Undeploy flink-operator and uncreate service namespace in staging-codfw

Reason:

comment from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874/ "operator (not the apps) should be/stay deployed in staging-codfw as well"

https://gerrit.wikimedia.org/r/922138

Grafana Dash looking good.

In T333464#8884003, @Ottomata wrote:

Grafana Dash looking good.

Apart from the managed flink clusters in staging-eqiad being empty I agree. :) The way the operator exposes metrics does not seem ideal (needing all the label_replace because they don't just provide the deployment as al label) though. It might make sense to craft a generic relabel rule in prometheus to fix that.
What would be states we would alert on? From the dashboard I would assume something like cluster lifecycle state != STABLE for Xmin and deployment status != READY for Xmin?

Apart from the managed flink clusters in staging-eqiad being empty I agree

Ah, the value was 0 (?) so it wasn't being 'colored'. Fixed to show the number per namespace and always show it.

What would be states we would alert on? Probably lifecycle state != STABLE would be enough. I think if lifecycle is STABLE then deployment status is probably ready. But ya that sounds right.

Change 922874 merged by jenkins-bot:

[operations/deployment-charts@master] flink-operator - deploy in wikikube eqiad and codfw

https://gerrit.wikimedia.org/r/922874

Deployed in all wikikube clusters.

We'll have to re-enable operator egress to Zookeeper when we figure out what to do about HA. https://phabricator.wikimedia.org/T331283#8874029

Ottomata moved this task from In Review to Done on the Event-Platform (Sprint 14 A) board.May 30 2023, 4:04 PM

JArguello-WMF closed this task as Resolved.Jun 5 2023, 11:31 AM

	F36932246: flink-kubernetes-operator.svg
	Mar 29 2023, 4:02 PM

New Service Request: flink-kubernetes-operatorClosed, ResolvedPublicActions