Page MenuHomePhabricator

Determine flink metrics configuration and backend when running from k8s
Open, HighPublic

Description

As a streaming updater maintainer I want to setup the flink metrics system so that I can have access to dashboards to visualize and the check the health of the pipeline.

The pipeline dashboard is available at https://grafana.wikimedia.org/d/_kZ1VGRGk/wdqs-pipeline?orgId=1&refresh=1m and was created as part of T248450.

Running in hadoop we use this configuration:

metrics.reporters: graphite
#
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: graphite-in.eqiad.wmnet
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP

metrics.scope.tm: flink.taskmanager
metrics.scope.tm.job: flink.taskmanager.<job_name>
metrics.scope.task: flink.taskmanager.<job_name>.<task_name>
metrics.scope.operator: flink.taskmanager.<job_name>.<task_name>.<operator_name>.<subtask_index>

metrics.scope.jm: flink.jobmanager
metrics.scope.jm.job:  flink.jobmanager.<job_name>

But this is now failing with:

2020-10-22 09:21:38,846 WARN  org.apache.flink.runtime.metrics.MetricRegistryImpl          [] - Error while registering metric: numBytesIn.
java.lang.IllegalArgumentException: A metric named flink.taskmanager.WDQS Streaming Updater POC.DecideMutationOperation -> RouteIgnoredMutationToSideOutput.numBytesIn already exists
	at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91) ~[metrics-core-3.0.2.jar:3.0.2]
	at org.apache.flink.dropwizard.ScheduledDropwizardReporter.notifyOfAddedMetric(ScheduledDropwizardReporter.java:131) ~[?:?]
	at org.apache.flink.runtime.metrics.MetricRegistryImpl.register(MetricRegistryImpl.java:344) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
	at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.addMetric(AbstractMetricGroup.java:426) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
	at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.counter(AbstractMetricGroup.java:359) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
	at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.counter(AbstractMetricGroup.java:349) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
	at org.apache.flink.runtime.metrics.groups.ProxyMetricGroup.counter(ProxyMetricGroup.java:52) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
	at org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup.<init>(TaskIOMetricGroup.java:53) ~[flink-dist_2.11-1.11.2.jar:1.11.2]
[...]

and the dashboard is no longer functioning.

AC:

Event Timeline

dcausse created this task.Oct 23 2020, 10:16 AM
Restricted Application added a project: Wikidata. · View Herald TranscriptOct 23 2020, 10:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CBogen triaged this task as High priority.Nov 2 2020, 6:24 PM
CBogen moved this task from Current work to Scaling on the Wikidata-Query-Service board.