Page MenuHomePhabricator

Flink metrics should use a consistent label for each job
Closed, ResolvedPublic

Description

Search would like to deploy multiple flink-apps per k8s namespace using helm releases.

However, this will complicate dashboarding and monitoring.

Only the job related metrics have the Flink job_name label in them. Some important task related ones do not.

Example: https://grafana.wikimedia.org/goto/he98JcQVk?orgId=1

In the table of results at the bottom, you can see that neither flink_jobmanager_numRegisteredTaskManagers nor flink_taskmanager_Status_JVM_Memory_Heap_Used have the job_name label.

In the dashboard I've been working on, I've been using kubernetes_namespace to select the job. If we deploy multiple jobs per namespace, we'll need to use something else, and we can't use job_name.

All metrics will have the helm release label in them. We could use that.

If we are going to do this, we need to adjust dashboards to use release as the canonical 'job' name. To do this, we should adopt a convention for all flink-app deployments, and ensure that helm release matches job_name.

Alternatively, perhaps it is possible to configure all scoped flink metrics to include job_name?

Done is

  • Adjust Flink Dashboard to use be able to select and use release and job_name in queries where appropriate.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ah, I wanted to make job_name == release name the default behavior in the flink-app chart, but that's not possible, as job_name is set by the application config or code, not by flink.

Hm, there might be a way to configure flink to include the job_name in[[ https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope | all metric scopces ]]? Investigating...

Ottomata renamed this task from Flink-app helmfile deployments should ensure release and Flink job_name are equivalent to Flink metrics should use a consistent label for each job.May 25 2023, 4:07 PM
Ottomata updated the task description. (Show Details)

Nope, I don't think that works. Makes sense too. It is possible for a Flink JobManager (even in app deployment mode) to run multiple jobs. So we can't get per job stats on e.g. JobManager memory usage.

I think making release name == job_name is going to be the way to go.

Okay, after working on the dashboard a bit, I think I'm going to resolve this task.

We have release which does uniquely identify a FlinkDeployment. I've made release and job_name variables on the dashboard, and filtered on them in appropriate queries. It'd be nice if release == job_name, but that isn't going to be enforceable everywhere. We can decide if we want to do that for enrichment apps specifically, otherwise, the default of 'main' isn't so bad.

Ottomata claimed this task.