Page MenuHomePhabricator

Push Gobblin import metrics to Prometheus and add alerts on some critical imports
Closed, ResolvedPublic

Description

Should be able to use Prometheus Push Gateway for this.

Event Timeline

Gobblin mentions emitting metrics via Kafka here: https://gobblin.apache.org/docs/metrics/Metrics-for-Gobblin-ETL/

Is there native support for Prometheus?

No, Joseph was going to have to add it.

Change 724825 had a related patch set uploaded (by Joal; author: Joal):

[analytics/gobblin@wmf] Make SimpleStringWriter instrumented for metrics

https://gerrit.wikimedia.org/r/724825

I have added some findings about how Gobblin generates/defines metrics here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin#Publishing_metrics
I'll be interested in discussing details with whoever is interested, trying to make decisions on which metrics and how to publish them.

Summary of yesterday's discussion with @Ottomata and @fgiunchedi -- thanks folks :)

  • Usual way to represent new successful instances of a job is to sent an end-timestamp as metric. It would be a counter (always increasing), and allows to monitor against current-time that the difference is within expected boundaries. Note: special case for when data is missing, as it would be when the push-gatewya gets restarted).
  • The number of metrics generated per job-type * tasks is manageable, allowing to sent metrics for ingested-rows and written-rows. We will need to define a task tag so that per task metrics don't overwrite each-other. There is no real value in knowing the task-value, but since gobblin naturally doesn't aggregate them and the cardinality is not high we'll go with that representation. Max estimated metrics: 4 job-types * 128(max) tasks * 5 metrics
  • We wish to be able to monitor kafka pulled data for data quality (how many rows pulled per topic-partition per job instance). Gobblin doesn't have those metrics as is so we'll need to add them (should be feasible). For those metrics we won't add the task tag as we know that pulling from a topic-partition is done within a single task (but one task can pull from multiple topic-partitions). Not adding the task tag reduces the potential number of generated metrics by 128, so it's worth it (it's would actually be too many metrics if we added it). Max expected metrics: 4 metrics * 10000 topic-partitions (overestimated).

Our plan is to devise the split the work in two tasks, first sending existing metrics and end-of-job timestamp, then updating Gobblin for the kafka new metrics and send them.

Thank you for the summary @JAllemandou, looks great! A few points in no particular order:

I think that's it for now, feel free to add me to reviews and/or more tasks too. Thanks for trying out pushgateway and reaching out!

Change 772829 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/alerts@master] Add 2 new alerts for data-engineering gobblin

https://gerrit.wikimedia.org/r/772829

Change 772829 merged by Ottomata:

[operations/alerts@master] Add 2 new alerts for data-engineering gobblin

https://gerrit.wikimedia.org/r/772829