Flink metrics should use a consistent label for each job
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	May 25 2023, 3:54 PM

Description

Search would like to deploy multiple flink-apps per k8s namespace using helm releases.

However, this will complicate dashboarding and monitoring.

Only the job related metrics have the Flink job_name label in them. Some important task related ones do not.

Example: https://grafana.wikimedia.org/goto/he98JcQVk?orgId=1

In the table of results at the bottom, you can see that neither flink_jobmanager_numRegisteredTaskManagers nor flink_taskmanager_Status_JVM_Memory_Heap_Used have the job_name label.

In the dashboard I've been working on, I've been using kubernetes_namespace to select the job. If we deploy multiple jobs per namespace, we'll need to use something else, and we can't use job_name.

All metrics will have the helm release label in them. We could use that.

If we are going to do this, we need to adjust dashboards to use release as the canonical 'job' name. To do this, we should adopt a convention for all flink-app deployments, and ensure that helm release matches job_name.

~~Alternatively, perhaps it is possible to configure all scoped flink metrics to include job_name?~~

Done is

Adjust Flink Dashboard to use be able to select and use release and job_name in queries where appropriate.

Related Objects
Search...

Status	Assigned	Task
Resolved	gmodena	T307959 [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content
Resolved	Ottomata	T325303 Deploy mediawiki-page-content-change-enrichment to wikikube k8s
Resolved	Ottomata	T325305 Deploy mediawiki-event-enrichment flink app to DSE k8s
Resolved	Ottomata	T328925 Flink App Deployment monitoring
Resolved	Ottomata	T337496 Flink metrics should use a consistent label for each job

Event Timeline

Ottomata created this task.May 25 2023, 3:54 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptMay 25 2023, 3:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ottomata added a parent task: T328925: Flink App Deployment monitoring.May 25 2023, 3:54 PM

Ottomata updated the task description. (Show Details)

Ah, I wanted to make job_name == release name the default behavior in the flink-app chart, but that's not possible, as job_name is set by the application config or code, not by flink.

Hm, there might be a way to configure flink to include the job_name in[[ https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-scope | all metric scopces ]]? Investigating...

Ottomata renamed this task from Flink-app helmfile deployments should ensure release and Flink job_name are equivalent to Flink metrics should use a consistent label for each job.May 25 2023, 4:07 PM

Ottomata updated the task description. (Show Details)

Nope, I don't think that works. Makes sense too. It is possible for a Flink JobManager (even in app deployment mode) to run multiple jobs. So we can't get per job stats on e.g. JobManager memory usage.

I think making release name == job_name is going to be the way to go.

Ottomata updated the task description. (Show Details)May 25 2023, 7:26 PM

Okay, after working on the dashboard a bit, I think I'm going to resolve this task.

We have release which does uniquely identify a FlinkDeployment. I've made release and job_name variables on the dashboard, and filtered on them in appropriate queries. It'd be nice if release == job_name, but that isn't going to be enforceable everywhere. We can decide if we want to do that for enrichment apps specifically, otherwise, the default of 'main' isn't so bad.

Ottomata closed this task as Resolved.May 25 2023, 7:28 PM

Ottomata claimed this task.

Flink metrics should use a consistent label for each jobClosed, ResolvedPublicActions

Description

Done is

Related ObjectsSearch...

Event Timeline

Flink metrics should use a consistent label for each job
Closed, ResolvedPublic
Actions

Related Objects
Search...