Page MenuHomePhabricator

New Service Request: flink-kubernetes-operator
Closed, ResolvedPublic

Description

Description: flink-kubernetes-operator handles running 'native' Flink clusters in Kubernetes. It allows us to describe FlinkDeployment k8s resources and manage the lifecycle of running Flink applications.

flink-kubernetes-operator has already been developed, deployed, and tested in the dse-k8s-eqiad cluster. We'd like to deploy it to wikikube staging, eqiad, and codfw in order to run Flink apps there.

Timeline: Before 2023-04
Diagram:

Deployment checklist

Event Timeline

Change 904226 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Install flink operator in wikikube staging-eqiad

https://gerrit.wikimedia.org/r/904226

Could you please share resource requirements for the operator from your experiments on DSE here so that we know what to expect?

Change 905295 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] admin_ng/flink-operator - fix prometheus reporting configuration

https://gerrit.wikimedia.org/r/905295

Change 905295 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng/flink-operator - fix prometheus reporting configuration

https://gerrit.wikimedia.org/r/905295

Could you please share resource requirements for the operator from your experiments on DSE here so that we know what to expect?

Should be pretty minimal. We're currently running one or two flink apps. pod mem is < 1G, CPU minimal.

The operator itself doesn't do much work. It mostly handles deployment of the JobManager pod(s) in the app's namespace. Beyond that the JobManager is responsible for talking with k8s to get the resources it needs to spawn TaskManagers.

More deployed flink-apps will mean more work for the operator, but I don't expect it every to be a resource hog.

We haven't yet experimented with operator HA, but if we do I expect we'd just run two operator replica pods.

BTW, IIUC T331283: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper is an app specific configuration. The flink app (not the flink k8s operator) stores the latest checkpoints, so this can even be different per app.

I'm going to remove this as a checklist item for this task, and put it in T330507.

Could you please share resource requirements for the operator from your experiments on DSE here so that we know what to expect?

Should be pretty minimal. We're currently running one or two flink apps. pod mem is < 1G, CPU minimal.

That's actually quite a bit of memory, I had expected less. :) From what I see you are running the operator completely unbound (e.g. no resource definitions). That is nothing we can do on wikikube and, as it's JVM, setting a memory limit probably changes the "requirement" as well. I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi for in a comment for example).

BTW, IIUC T331283: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper is an app specific configuration. The flink app (not the flink k8s operator) stores the latest checkpoints, so this can even be different per app.

I'm going to remove this as a checklist item for this task, and put it in T330507.

Indeed. But not knowing much about how this is configured I would very much like the "flink-app" chart to force users into using Zookeeper/the same HA service implementations. I can imagine this becoming quite messy if different flink clusters use different HA service implementations (even by accident).

I also remember open questions about the flink operator webhook. Did you figure out what it does, in which cases it might be required or expected? AIUI the recommended installation does set it up an currently we don't.

Ah, I now recall that was answered already:

what the webhook actually does

Responses from Flink mailing list:

webhooks in general are optional components of the k8s operator pattern. Mostly used for validation, sometimes for changing custom resources and handling multiple versions, etc. It's an optional component in the Flink Kubernetes Operator too.

Validation in itself is a mandatory step for every spec change that is submitted to guard against broken configs (things like negative parallelism etc).
But validation can happen in 2 places. It can be done through the webhook, which would result in upfront rejection of the spec on validation error.
Or it can happen during regular processing/reconciliation process in which case errors are recorded in the status .
The webhook is nice way to get validation error’s immediately but as you see it’s not necessary as validation would happen anyways .

So, I suppose its a nice to have, but not necessary for any functionality.

I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi for in a comment for example).

Oh, okay will do.

I would very much like the "flink-app" chart to force users into using Zookeeper/the same HA service implementations.

Agree, or at least provide the defaults. I'm not exactly sure how we'd do this though, other than in documentation or some helmfile/services template generator. I don't want to put hardcoded ZK connection info into the chart.

But, in either case it is an app related setting; it is possible to run in non HA / 'stateless' mode. We are doing this in DSE now while we wait for T330693.

Change 908310 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] flink-operator - set default resource limits and requests

https://gerrit.wikimedia.org/r/908310

Change 908310 merged by Ottomata:

[operations/deployment-charts@master] flink-operator - set default resource limits and requests

https://gerrit.wikimedia.org/r/908310

Change 908334 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] flink-operator - set default resource limits and requests in operatorPod

https://gerrit.wikimedia.org/r/908334

Change 908334 merged by Ottomata:

[operations/deployment-charts@master] flink-operator - set default resource limits and requests in operatorPod

https://gerrit.wikimedia.org/r/908334

I would suggest to experiment with proper values in DSE first (the charts values.yaml suggests 512Mi for in a comment for example).

Okay, nice. The JVM will automatically detect the container memory granted to it, and set max and starting Java Heap size accordingly. I set the containr memory to 512Mi, and now JVM Heap is stable between 50 and 50MB NonHeap memory does seem to very slowly increase, but not any more or less that it had before I limited container memory. Previously, NonHeap flattened out at around 130MB used. It looks like it did slowly increase, but at around a rate of < 1MB per day. We should keep an eye on that I guess, but it doesn't look like it is causing a problem?

@JMeybohm whatcha think?

@JMeybohm I'd like to proceed, but first we need to create the flink-operator namespace in staging-eqiad and staging-codfw. I seem to recall there was something more to it than just kubectl create namespace ..., right? Can you help with this?

Once the namespaces are created, I believe the deployment procedure is:

sudo -i
cd /srv/deployment-charts/helmfile.d/admin_ng
helmfile -e staging-codfw diff
# if looks good:
helmfile -e staging-codfw apply

Change 904226 merged by jenkins-bot:

[operations/deployment-charts@master] Install flink operator in wikikube staging-eqiad

https://gerrit.wikimedia.org/r/904226

Change 917373 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] admin_ng/flink-operator - set default rbac.create: false

https://gerrit.wikimedia.org/r/917373

Change 917373 merged by Ottomata:

[operations/deployment-charts@master] admin_ng/flink-operator - set default rbac.create: false

https://gerrit.wikimedia.org/r/917373

Change 919108 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] flink-operator: only deploy it to wikikube@stagings

https://gerrit.wikimedia.org/r/919108

Change 919108 merged by jenkins-bot:

[operations/deployment-charts@master] flink-operator: only deploy it to wikikube@stagings

https://gerrit.wikimedia.org/r/919108

Based on a convo with @akosiaris, we need to undeploy flink-operator in staging-codfw, as well as mw-page-content-change-enrich namespace there.

Change 922138 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Undeploy flink-operator and uncreate service namespace in staging-codfw

https://gerrit.wikimedia.org/r/922138

Working a bit on flink-kubernetes-operator dashboard; I think there might be a small bug in the operator lifecycle metrics.

Change 922874 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] flink-operator - deploy in wikikube eqiad and codfw

https://gerrit.wikimedia.org/r/922874

Change 922138 abandoned by Ottomata:

[operations/deployment-charts@master] Undeploy flink-operator and uncreate service namespace in staging-codfw

Reason:

comment from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922874/ "operator (not the apps) should be/stay deployed in staging-codfw as well"

https://gerrit.wikimedia.org/r/922138

Grafana Dash looking good.

Apart from the managed flink clusters in staging-eqiad being empty I agree. :) The way the operator exposes metrics does not seem ideal (needing all the label_replace because they don't just provide the deployment as al label) though. It might make sense to craft a generic relabel rule in prometheus to fix that.
What would be states we would alert on? From the dashboard I would assume something like cluster lifecycle state != STABLE for Xmin and deployment status != READY for Xmin?

Apart from the managed flink clusters in staging-eqiad being empty I agree

Ah, the value was 0 (?) so it wasn't being 'colored'. Fixed to show the number per namespace and always show it.

What would be states we would alert on? Probably lifecycle state != STABLE would be enough. I think if lifecycle is STABLE then deployment status is probably ready. But ya that sounds right.

Change 922874 merged by jenkins-bot:

[operations/deployment-charts@master] flink-operator - deploy in wikikube eqiad and codfw

https://gerrit.wikimedia.org/r/922874

Deployed in all wikikube clusters.

We'll have to re-enable operator egress to Zookeeper when we figure out what to do about HA. https://phabricator.wikimedia.org/T331283#8874029