Page MenuHomePhabricator

Flink App Deployment monitoring
Closed, ResolvedPublic

Description

We'd like to have automated monitoring and alerts configured for flink-app deployments.

  • Adapt and augment the existent Flink Cluster grafana dashboard for a shared flink app, pyflink, enrichment dashboard usable by Search and Event Platform. This should include metrics about latency, lag, throughput, memory usage, etc.
  • If needed, add missing metrics to enrichment flink apps

Another task will be about defining alerts (aliveness, latency, lag, throughput, etc.) for mediawiki-page-content-change-enrichment job. If we can do this more generically for any enrichment job, we should, but TBD how easy that is.

Event Timeline

Starting a WIP grafana dashboard.

It seems Flink Kafka sources emit KafkaConsumer metrics, but Flink Kafka sinks do not emit KafkaProducer metrics? Hm. Tricky.

@bking I think this will be relevant for rdf-streaming-updater and other work, if you all plan to use the newer Kafka Source and Sinks, instead of the old deprecated Kafka connectors.

It seems Flink Kafka sources emit KafkaConsumer metrics, but Flink Kafka sinks do not emit KafkaProducer metrics?

Oh, I got a response on the flink mailing list about the Kafka producer metrics. It should work?

otto opened https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/31

Emit event counts, invocation time, and python process memory usage for process function

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/merge_requests/31

Emit event counts, invocation time, and python process memory usage for process function

Change 905295 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] admin_ng/flink-operator - fix prometheus reporting configuration

https://gerrit.wikimedia.org/r/905295

Change 905295 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng/flink-operator - fix prometheus reporting configuration

https://gerrit.wikimedia.org/r/905295

Ottomata renamed this task from Flink Enrichment monitoring to Flink App Deployment monitoring.May 25 2023, 7:28 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)

@dcausse @bking @gmodena @tchin I'm feeling good about this Flink App Dashboard (previously named Flink Otto WIP).

@dcausse I copied your RocksDB stat panels from Flink Cluster DCausse, but I think everything else you had there and in Flink Cluster is covered.

Flink App does not work with the existent flink-session-cluster deployment, but will work with all FlinkDeployments via flink-app chart.

Please take a look and let me know what you think, and if there is anything you think is missing!

I'd like to make a couple of more tweaks, but I'm getting close to calling this task done.

@dcausse @bking @gmodena @tchin I'm feeling good about this Flink App Dashboard (previously named Flink Otto WIP).

@dcausse I copied your RocksDB stat panels from Flink Cluster DCausse, but I think everything else you had there and in Flink Cluster is covered.

Flink App does not work with the existent flink-session-cluster deployment, but will work with all FlinkDeployments via flink-app chart.

The dashboard looks great!
Thanks for bringing in the rocksdb metric but sadly the metric names is dependent on app (e.g. flink_taskmanager_job_task_operator_lastSeenRev_rocksdb_estimate_live_data_size) so they won't work well in a generic dashboard, I'll remove them.
For flink-session-cluster it should be no big deal I hope we can migrate over the k8s operator soon.

flink_taskmanager_job_task_operator_lastSeenRev_rocksdb_estimate_live_data_size

Oh I see, yeah, that's too bad. I have been able to parameterize stuff in the metric name, but yeah, we'd have to make some standardized operator/metric interface to do that right.

Its also really a shame that the Flink custom metrics interface doesn't let you define labels. I wonder how the Prometheus integration even does it? There is talk of metric 'variables', but they all seem predefined by Flink. Hm.

@dcausse also, the Kafka specific panels assume the Kafka Source and Sinks, not the older Source/SinkFunctions, so I think not all of the panels there work with rdf-streaming-updater.

@dcausse also, the Kafka specific panels assume the Kafka Source and Sinks, not the older Source/SinkFunctions, so I think not all of the panels there work with rdf-streaming-updater.

Yes I've seen, it should be no big deal for us in the short term, but once we address T326914 we should have them :)

flink_taskmanager_job_task_operator_lastSeenRev_rocksdb_estimate_live_data_size

@dcausse, maybe you know this, but TIL for custom metrics, if you add them into a group you create like addGroup(key, value), instead of just addGroup(key), you'll get labels in prometheus with key=value. I dunno where your metric is being created, but you could probably do something like:

# stateName = "lastSeenRev"
rocksdbMetricGroup = getRuntimeContext().getMetricGroup().addGroup("rocksdb");
stateMetricGroup = rocksdbMetricGroup.addGroup("state", stateName);
liveDataSizeMetric = stateMetricGroup.guage("live_data_size");

And I think you'd end up with metrics in Prometheus like

flink_taskmanager_job_task_operator_rocksdb_state_live_data_size{state="lastSeenRev", ... }

Keeping the dimensions out of the metric names.

Change 923687 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrich - bump to image version 1.18.0

https://gerrit.wikimedia.org/r/923687

Change 923687 merged by Ottomata:

[operations/deployment-charts@master] mw-page-content-change-enrich - bump to image version 1.18.0

https://gerrit.wikimedia.org/r/923687

Change 923699 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrich - bump to image version 1.19.0

https://gerrit.wikimedia.org/r/923699

Change 923699 merged by Ottomata:

[operations/deployment-charts@master] mw-page-content-change-enrich - bump to image version 1.19.0

https://gerrit.wikimedia.org/r/923699