Flink application and flink-kubernetes-operator production docker images
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Aug 29 2022, 12:31 PM

Description

As an event-stream developer I want to have access to a base flink image to use with k8s.

https://gerrit.wikimedia.org/g/wikidata/query/flink-rdf-streaming-updater was created with "Application Mode" deployment in mind but we then switched to a "Session Cluster" deployment approach, this image no longer references any job specific information and thus is appropriate to use as a reusable image.

We should probably rename it and/or move it to a place where it is more obvious that it can be re-used across different projects.

Details

Subject	Repo	Branch	Lines +/-
flink-app-example - set upgradeMode: stateless	operations/deployment-charts	master	+1 -0
flink 1.16.0-wmf3	operations/docker-images/production-images	master	+41 -16
flink - Add examples/wikimedia with simple table datagen -> print pipeline	operations/docker-images/production-images	master	+57 -0
flink - include examples in image	operations/docker-images/production-images	master	+8 -0
flink-kubernetes-operator - add -Dmaven.antrun.skip=true to mvn package	operations/docker-images/production-images	master	+4 -1
flink-kubernetes-operator - fix command that sets MVN_HTTP(S)_PROXY_OPTION	operations/docker-images/production-images	master	+5 -5
flink-kubernetes-operator - use explicit mvn proxy settings instead of java.net.useSystemProxies	operations/docker-images/production-images	master	+7 -2
Add flink to profile::docker::builder::known_uid_mappings	operations/puppet	production	+1 -0
Update flink-kubernetes-operator chart with upstream changes for 1.3.0	operations/deployment-charts	master	+202 -35
flink and flink-kubernetes-operator image	operations/docker-images/production-images	master	+377 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• lbowmaker	T324578 [EPIC] Flink Applications on Kubernetes
		Resolved		Ottomata	T316519 Flink application and flink-kubernetes-operator production docker images

Event Timeline

dcausse created this task.Aug 29 2022, 12:31 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptAug 29 2022, 12:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Sep 6 2022, 10:40 AM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Sep 6 2022, 10:45 AM

gmodena mentioned this in T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.Nov 16 2022, 1:17 PM

Ottomata claimed this task.Nov 17 2022, 2:10 PM

BTullis subscribed.Nov 17 2022, 2:26 PM

Change 858356 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] WIP flink image

https://gerrit.wikimedia.org/r/858356

gerritbot added a project: Patch-For-Review.Nov 17 2022, 10:37 PM

Ottomata moved this task from Backlog to Sprint 05 on the Event-Platform board.Nov 28 2022, 2:04 PM

Ottomata edited projects, added Event-Platform (Sprint 05); removed Event-Platform.

Ottomata moved this task from Next Up to In Progress on the Event-Platform (Sprint 05) board.

Writing down some ideas and thoughts from todays talk with @gmodena:

We will only support Application Mode Deployments, no session clusters
We will only support running in k8s, not regular docker / docker-compose. This allows us to keep the flink image entrypoint simpler. The upstream one mangles the flink-conf.yaml file with some defaults and settings from FLINK_PROPERTIES env var, but we would prefer to not mess with this file in the image directly, but instead provide it via usual k8s configmap and help templates.
Application images that are actually deployed will be built FROM this base Flink image using Deployment Pipeline. Since we are only going to support Application Mode, the base Flink image will not be useful on its own.

Ottomata added subscribers: bking, JMeybohm.Nov 30 2022, 2:34 PM

Status update!

flink and flink-kubernetes-operator images are ready for review.

I've made some changes to upstream's Dockerfiles and entrypoints for these. Notably:

flink-kubernetes-operatore webhook is not supported. IIUC, we don't use a webhook like this in production, but instead use our own mechanism to provide TLS stuff?
- Because of this, we will need to either always set webhook.create: false in our flink-operator helm values if we are using the upstream helm chart, OR, if/when we have our own version, just remove all the webhook bits, cc @bking for T321491 (should we make a specific task for the flink helm bits?)

Added ECS logging dependencies.
- We'll want to enable these in our flink operator helm chart's log4j.properties files as noted here.

Removed unneeded docker-entrypoint.sh logic for the flink image. If we are only supporting running in k8s, upstream's flink-docker docker-entrypoint.sh is not useful.

There are still a few TODOs from me in the code, mostly around figuring out exactly what flink plugins and other dependencies to include in this default image.

Ottomata added a parent task: T321491: Evaluate Flink Operator on DSE Kubernetes Cluster for deployment and management of stateful search applications.Dec 5 2022, 9:38 PM

In T316519#8444547, @Ottomata wrote:

flink-kubernetes-operatore webhook is not supported. IIUC, we don't use a webhook like this in production, but instead use our own mechanism to provide TLS stuff?

Because of this, we will need to either always set webhook.create: false in our flink-operator helm values if we are using the upstream helm chart, OR, if/when we have our own version, just remove all the webhook bits, cc @bking for T321491 (should we make a specific task for the flink helm bits?)

I might be missing bits but I don't think the webhook has something to do with TLS (apart from the fact that it needs to present a certificate to the apiserver that can be trusted). The webhook are usually around for additional validation (or mutation) of K8s objects created by computers or humans. Although I'm not really sure what it does in case of flink it gets registered for flinkdeployments and flinksessionjobs (https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/templates/webhook.yaml). I'm not sure if the cert-manager Certificate/Issuer stuff in that file should just works in clusters with cert-manager + cfssl-issuer enabled but it absolutely might. The cert-manager.io/inject-ca-from annotation to the webhooks tells the cert-manager cainjector to populate the CA from the given secret as caBundle to the *WebhookConfiguration where the Kubernetes API loads it from to verify the connection to the webhook.

I don't think the webhook has something to do with TLS

Ah okay, I am super green here and don't have much experience writing helm outside of doing so for services in our deployment-charts. Ben explained a bunch in IRC to me too. He said I could paste his messages here.

@BTullis wrote:

Re the webhook, longer term I'm definitely in favour of using it for the spark-operator. It's optional, but offers some functionality that would be really useful. Namely things like https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#requesting-gpu-resources and https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#mounting-a-configmap-storing-hadoop-configuration-files

However, I'm removing it for the moment because of the way support for it was implemented in the helm chart by the upstream project. The best explanation I've written for why is in this commit: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/864770

In basic terms, it creates a keypair and an auth token in this hacky script: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/hack/gencerts.sh
It sends these to the K8S API. I was going to try to get this working while saying "I'll probably improve this later..." but in the end I decided that it would be better to switch it off for now and use the puppet secret mechanism and cert-manager or whatever for doing the TLS, when I get around to it.

For the flink webhook, it's a bit less clear to me what the benefits of it would actually be. We can see from the helm chart a little bit about what the functionality would be: https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/templates/webhook.yaml#L94-L105
So it can validate any create or update operation on`flinkdeployments` and flinksessionjobs. As I understand it, this is like an extra layer of access control, determining whether or not this object can be created or updated.
I suspect that in this case we an do without this additional level of access control, given that we're in a pretty well controlled environment.
It can also *mutate* any create operation on a flinksessionjob: https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/templates/webhook.yaml#L94-L105
It's not clear to me from the docs what this mutation might do, but like the spark operator it's something about adding annotations. Here's a PR showing that it can be used to add labels to sessionjobs: https://github.com/apache/flink-kubernetes-operator/pull/265

So in short, I think that you can probably get away without it for now :-)

JArguello-WMF moved this task from In Progress to In Review on the Event-Platform (Sprint 05) board.Dec 6 2022, 2:08 PM

Ottomata renamed this task from Create a shared flink docker image to Flink application and flink-kubernetes-operator production docker images.Dec 6 2022, 3:24 PM

Ottomata mentioned this in T324578: [EPIC] Flink Applications on Kubernetes.Dec 6 2022, 3:33 PM

Ottomata added a parent task: T324578: [EPIC] Flink Applications on Kubernetes.

Ottomata removed a parent task: T321491: Evaluate Flink Operator on DSE Kubernetes Cluster for deployment and management of stateful search applications.

dcausse mentioned this in T326318: Create docker images for the cirrus-streaming-updater flink jobs.Jan 5 2023, 2:29 PM

Change 876249 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Update flink-kubernetes-operator chart with upstream changes for 1.3.0

https://gerrit.wikimedia.org/r/876249

JArguello-WMF moved this task from Sprint 05 to Sprint 07 on the Event-Platform board.Jan 9 2023, 2:27 PM

JArguello-WMF edited projects, added Event-Platform (Sprint 07); removed Event-Platform (Sprint 05).

JArguello-WMF moved this task from Next Up to In Review on the Event-Platform (Sprint 07) board.Jan 9 2023, 2:30 PM

Change 858356 merged by Ottomata:

[operations/docker-images/production-images@master] flink and flink-kubernetes-operator image

https://gerrit.wikimedia.org/r/858356

Change 876249 merged by jenkins-bot:

[operations/deployment-charts@master] Update flink-kubernetes-operator chart with upstream changes for 1.3.0

https://gerrit.wikimedia.org/r/876249

Change 877193 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add flink to profile::docker::builder::known_uid_mappings

https://gerrit.wikimedia.org/r/877193

Change 877193 merged by Ottomata:

[operations/puppet@production] Add flink to profile::docker::builder::known_uid_mappings

https://gerrit.wikimedia.org/r/877193

Change 877230 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] flink-kubernetes-operator - use explicit mvn proxy settings instead of java.net.useSystemProxies

https://gerrit.wikimedia.org/r/877230

Change 877230 merged by Ottomata:

[operations/docker-images/production-images@master] flink-kubernetes-operator - use explicit mvn proxy settings instead of java.net.useSystemProxies

https://gerrit.wikimedia.org/r/877230

Change 877237 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] flink-kubernetes-operator - fix command that sets MVN_HTTP(S)_PROXY_OPTION

https://gerrit.wikimedia.org/r/877237

Change 877237 merged by Ottomata:

[operations/docker-images/production-images@master] flink-kubernetes-operator - fix command that sets MVN_HTTP(S)_PROXY_OPTION

https://gerrit.wikimedia.org/r/877237

Change 877241 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] flink-kubernetes-operator - add -Dmaven.antrun.skip=true to mvn package

https://gerrit.wikimedia.org/r/877241

Change 877241 merged by Ottomata:

[operations/docker-images/production-images@master] flink-kubernetes-operator - add -Dmaven.antrun.skip=true to mvn package

https://gerrit.wikimedia.org/r/877241

Change 878178 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] flink - include examples in image

https://gerrit.wikimedia.org/r/878178

Change 878178 merged by Ottomata:

[operations/docker-images/production-images@master] flink - include examples in image

https://gerrit.wikimedia.org/r/878178

Ottomata mentioned this in T326731: Deployment pipeline docker image of flink mediawiki stream enrichment pyhon.Jan 11 2023, 1:45 PM

Ottomata moved this task from In Review to Done on the Event-Platform (Sprint 07) board.Jan 11 2023, 1:59 PM

Change 879050 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] flink - Add examples/wikimedia with simple table datagen -> print pipeline

https://gerrit.wikimedia.org/r/879050

Change 879050 merged by Ottomata:

[operations/docker-images/production-images@master] flink - Add examples/wikimedia with simple table datagen -> print pipeline

https://gerrit.wikimedia.org/r/879050

Hm, am confused by a production-images vs blubber user thing.

In operation/production-images, we have a known_uid_mappings (also in puppet) which I assumed would be the run user for the container in prod.

However, blubber seems to use 'somebody' as the build user and file owner, and 'runuser' as the USER the container runs processes as.

Should we change this? Should we set the runs.as to something different when building images based of of the production-images flink image with blubber?

Change 881011 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/docker-images/production-images@master] flink 1.16.0-wmf3

https://gerrit.wikimedia.org/r/881011

Change 881011 merged by Ottomata:

[operations/docker-images/production-images@master] flink 1.16.0-wmf3

https://gerrit.wikimedia.org/r/881011

In T316519#8532670, @Ottomata wrote:

Should we change this? Should we set the runs.as to something different when building images based of of the production-images flink image with blubber?

I think this is up to you all, and I don't know enough about flink to say. In general, if the effective runtime user needs access to things that are only user or group readable or writable by a different user that's already provided by the base image, that would be a case where overwriting runs.as, runs.uid and runs.gid would make sense. If that's not the case, I would just go with the default behavior which is the most restrictive in terms of effective runtime permissions of files/directories within the container.

If that's not the case, I would just go with the default behavior which is the most restrictive in terms of effective runtime permissions of files/directories within the container.

@gmodena maybe we should make the build stages run.as flink, but the production (and test?) run stages as the default runuser? I guess the problem is the log/ directory. Hm.

Change 883660 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] flink-app-example - set upgradeMode: stateless

https://gerrit.wikimedia.org/r/883660

Change 883660 merged by Ottomata:

[operations/deployment-charts@master] flink-app-example - set upgradeMode: stateless

https://gerrit.wikimedia.org/r/883660

JArguello-WMF closed this task as Resolved.Jan 27 2023, 8:13 PM

FYI, in order to make pyflink work with this image as well, we changed our installation method to pip install apache-flink, instead of downloading a Flink distro tarball. See T327494: Flink docker image should work with pyflink for more info.

Ottomata mentioned this in T333464: New Service Request: flink-kubernetes-operator.Mar 29 2023, 4:02 PM

Flink application and flink-kubernetes-operator production docker imagesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Flink application and flink-kubernetes-operator production docker images
Closed, ResolvedPublic
Actions

Related Objects
Search...