Page MenuHomePhabricator

[SPIKE] Deploy event driven stateless Flink service to DSE cluster
Closed, ResolvedPublic

Description

NOTE: Deployed service is experimental and not considered production
User Story
As a platform engineer, I need to experiment with deploying a stateless Flink application to the new DSE cluster
Why?
  • While we don’t currently plan to use the DSE cluster as our production environment we would still like to experiment deploying Flink apps on an environment that resembles production so we can understand the process more, discover pain points and ways to make deployments easier
Done is:
  • Service is deployed to the DSE cluster, runs for a few hours/days successfully
  • Service is turned off
  • Steps to deploy are documented on our wiki pages so we can review for pain points, ways to make simpler, etc
Dependencies tracking

Event Timeline

Here's a summary of discussions I had with folks currently involved with Flink and k8s.

I don't have access to DSE yet, but I managed to iterate a bit locally (minikube) with the flink-session-cluster chart and flink-rdf-streaming-updater docker image.

A prerequisite for moving forward with DSE will be updating the base docker image to 1.15. Spinning up a flink cluster locally was covered in https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Assess_what_is_required_for_the_enrichment_pipeline_to_run_on_k8s. I resumed from there.

With the current flink-session-cluster setup on wikikube, job submission and savepoint management logic is delegated to a Python utility available on deploy hosts (e.g. not managed by deployment-charts). For local (and likely DSE) experiments, once a cluster is up & running, we can submit jobs to it via the flinkcli or its web interface.

If possible, however, I'd be keen in managing services within deployment-charts. For stateless applications we might look at simpler lifecycles. I iterated a bit on this idea, prototyping:

  1. A chart for a service that submits a job to session cluster. The pod itself simply acts as a listener to an heartbeat (e.g. same thing we do on stat machines within a tmux session).
  2. A chart with an application deployment cluster that bundles the stateless service jar as well as Flink itself.

Resource footprint for either deployment seem reasonable. A concern is that taking this path would lead to re-implementing a UX very similar to the k8s operator (see https://phabricator.wikimedia.org/T315428). Both approaches require setting up docker images for job submission; a capability that would be orthogonal to either flink-session-cluster or the k8s operator. In fact, I ended up reusing docker images I put together when working on T315428.

I also want to acknowledge eventual concerns with docker images / charts / jobs proliferation. This aspects too seems orthogonal to Flink's deployment strategy.

A third deployment options that was discussed is
https://volcano.sh/en/docs/flink_on_volcano/. The likelihood of this being adopted on wikikube (our target prod env) is understandably low, so I did not pursue it further.

[...]

  1. A chart for a service that submits a job to session cluster. The pod itself simply acts as a listener to an heartbeat (e.g. same thing we do on stat machines within a tmux session).
  2. A chart with an application deployment cluster that bundles the stateless service jar as well as Flink itself.

@JMeybohm would either of these approaches (if we manage to get them to work) help with mitigating concerns re being able to observe which jobs are running inside a session cluster? Was option 1. something you already considered?

FWIW the flink k8s operator support this deployment natively. It should let us:

  1. configure a session cluster
  2. configure job submission

Some doc https://github.com/apache/flink-kubernetes-operator/tree/main/examples#adding-jobs

IIRC the application deployment cluster we ditched because of missing HA capabilities. Option 1. we might have considered as well but found that it was not easy to "get right" in an automated way (without adding quite some complexity). Especially in cases where a job has to resume from a tombstone/safepoint. But ofc. as long as the "job artifact/the jar file" is bundled in a container, which is deployed via helm charts, it makes it easy to recreate this exact state in the cluster (without knowing anything about flink).

Now that a namespace has been created, I'm picking this back up and wanted to sync.

@bking did you maybe have a chance to experiment with deploying the k8s operator? That'd be my preferred path forward, if suitable for us.

@BTullis @elukey as a next step I'd like to spin up a session cluster and deploy a job on it from a third party hosts (e.g a deployment box). It's not ideal, but I'd like a point of departure before adding
too many "experimental" charts to deployment-charts. How does this sound? I'd like the job to be based atop Flink 1.16 (we have a hard constraint on 1.15 as lower bound). Would it be ok to publish a one-off docker image that's a version bump of flink-rdf-streaming-updater? FYI: @Ottomata is looking at providing a more generic image, but it's not ready yet.

Hi @gmodena - I believe that since the flink-rdf-streaming-updater service uses the Deployment Pipeline, the decision about whether or not you can publish a one-off docker image based on that ultimately lies with the owners of that repository. I don't have any strong feelings against it, but care is needed. Personally, I would seriously consider trying to rationalize the names of things first and de-couple this requirement from the rdf-streaming-updater somehow.

The naming is of things is very confusing right now because the chart in use is called flink-session-cluster but the service that is deployed is called rdf-streaming-updater.

Within this helmfile.d/services/rdf-streaming-updater/helmfile.yaml file it references chart: wmf-stable/flink-session-cluster (link)

This chart has a default image of wikimedia/wikidata-query-flink-rdf-streaming-updater but both the image and the tag can be overridden in the values.yaml file of the helmfile service.

So if you do decide to go this way, you might be able to use a new and experimental service, which references the existing flink-session-cluster chart and overrides the container image with a new version of the wikimedia/wikidata-query-flink-rdf-streaming-updater image that contains flink version 1.16.

However, it strikes me as a bit messy. Personally, I would probably look at trying to spend more time on trying to do something a bit more 'green-field', either with the operator or by working with the search team to make the flink-session-cluster chart a bit less specific to the rdf-streaming-updater service.

Sorry, this is probably not what you really wanted to hear. :-)

Hey @BTullis

Hi @gmodena - I believe that since the flink-rdf-streaming-updater service uses the Deployment Pipeline, the decision about whether or not you can publish a one-off docker image based on that ultimately lies with the owners of that repository. I don't have any strong feelings against it, but care is needed. Personally, I would seriously consider trying to rationalize the names of things first and de-couple this requirement from the rdf-streaming-updater somehow.

Indeed, the image should be renamed (its in their TODO). We also have a task for creating prod flink images, but that's not in WIP yet.
Thanks for the pointers re templating / values overriding in helmfile. It's a viable approach for experimenting.

However,

However, it strikes me as a bit messy. Personally, I would probably look at trying to spend more time on trying to do something a bit more 'green-field', either with the operator or by working with the search team to make the flink-session-cluster chart a bit less specific to the rdf-streaming-updater service.

I guess we need to figure out some sequencing and dependencies. I was hoping we could target DSE with experimental / learning code, but I'm not keen in messing around too much on the master branch of deployment-charts. If the k8s operator turns out to be a viable path, that could simplify refactoring efforts. I'll discuss options with team.

I was hoping we could target DSE with experimental / learning code, but I'm not keen in messing around too much on the master branch of deployment-charts

I understand your concern around this point and in fact I face the same issue myself. I'm trying to do an experimental deploy of the spark-operator to the dse-k8s cluster, but it's difficult to be able to test something that I know to be incomplete because it shares a repository and a configuration mechanism with production workloads. The approach I'm taking is to ensure that the code is a) sufficiently isolated from the production systems and that b) the workload is safe, as far as I can realistically do so. But I'm still at the point of having to seek reviews for a service/component that is at an early stage of its development cycle, which is a bit awkward.

gmodena updated the task description. (Show Details)

Just a heads-up as I'm re-engaging. I plan to use the stream-enrichment-poc namespace within the next couple of weeks. Ping me here or on the Slack thread if this is going to be a problem and/or if you have any advice on this.

@bking something I noticed about the upstream helm charts that I think we'd want to change, is the default log4j configuration for the FlinkDeployments. I see no reason to write to the /opt/flink/log directory, and I am intentionally not creating it in the flink docker image. IIUC, the ConsoleAppender is all we need for logs to be captured and available to k8s. Also, as @dcausse noted here, we want an ECS logging format, so I _think_ we'll want to use the ConsoleAppender with the ECS format. This should be the defaultConfiguration in our flink-operator helm chart.

Ottomata updated the task description. (Show Details)