Push Gobblin import metrics to Prometheus and add alerts on some critical imports
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Jul 12 2021, 6:11 PM

Description

Should be able to use Prometheus Push Gateway for this.

Details

	Subject	Repo	Branch	Lines +/-
	Add 2 new alerts for data-engineering gobblin	operations/alerts	master	+87 -0
	Make SimpleStringWriter instrumented for metrics	analytics/gobblin	wmf	+7 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	JAllemandou	T271232 Replace Camus by Gobblin
Resolved	odimitrijevic	T288250 Operational Excellence - Q2 21/22
Resolved	Ottomata	T286503 Push Gobblin import metrics to Prometheus and add alerts on some critical imports
Resolved	JAllemandou	T286559 When gobblin fails, we should know about it
Resolved	Ottomata	T294420 Send some existing Gobblin metrics to prometheus
Resolved	ayounsi	T304001 Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN
Invalid	JAllemandou	T294422 Add gobblin metrics per kafka-topic-partition

Event Timeline

Ottomata created this task.Jul 12 2021, 6:11 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 12 2021, 6:11 PM

mforns triaged this task as High priority.Jul 19 2021, 3:37 PM

mforns added a subtask: T286559: When gobblin fails, we should know about it.

mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

• Zbyszko subscribed.Jul 22 2021, 2:01 PM

odimitrijevic added a project: Data-Engineering.Aug 2 2021, 10:05 PM

Gobblin mentions emitting metrics via Kafka here: https://gobblin.apache.org/docs/metrics/Metrics-for-Gobblin-ETL/

Is there native support for Prometheus?

No, Joseph was going to have to add it.

odimitrijevic edited parent tasks, added: T287991: Gobblin Monitoring; removed: T271232: Replace Camus by Gobblin.Aug 3 2021, 7:59 PM

odimitrijevic added a project: Data-Engineering-Kanban.Aug 5 2021, 3:28 PM

odimitrijevic moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Aug 12 2021, 4:35 PM

JAllemandou closed subtask T286559: When gobblin fails, we should know about it as Resolved.Sep 28 2021, 12:58 PM

JAllemandou claimed this task.Sep 28 2021, 1:13 PM

JAllemandou added a project: Analytics-Kanban.

JAllemandou edited parent tasks, added: T271232: Replace Camus by Gobblin; removed: T287991: Gobblin Monitoring.

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.Sep 28 2021, 1:17 PM

Change 724825 had a related patch set uploaded (by Joal; author: Joal):

[analytics/gobblin@wmf] Make SimpleStringWriter instrumented for metrics

https://gerrit.wikimedia.org/r/724825

gerritbot added a project: Patch-For-Review.Sep 29 2021, 7:12 PM

I have added some findings about how Gobblin generates/defines metrics here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin#Publishing_metrics
I'll be interested in discussing details with whoever is interested, trying to make decisions on which metrics and how to publish them.

JAllemandou added a parent task: T288250: Operational Excellence - Q2 21/22.Oct 26 2021, 12:37 PM

JAllemandou moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.Oct 26 2021, 3:35 PM

Summary of yesterday's discussion with @Ottomata and @fgiunchedi -- thanks folks :)

Usual way to represent new successful instances of a job is to sent an end-timestamp as metric. It would be a counter (always increasing), and allows to monitor against current-time that the difference is within expected boundaries. Note: special case for when data is missing, as it would be when the push-gatewya gets restarted).
The number of metrics generated per job-type * tasks is manageable, allowing to sent metrics for ingested-rows and written-rows. We will need to define a task tag so that per task metrics don't overwrite each-other. There is no real value in knowing the task-value, but since gobblin naturally doesn't aggregate them and the cardinality is not high we'll go with that representation. Max estimated metrics: 4 job-types * 128(max) tasks * 5 metrics
We wish to be able to monitor kafka pulled data for data quality (how many rows pulled per topic-partition per job instance). Gobblin doesn't have those metrics as is so we'll need to add them (should be feasible). For those metrics we won't add the task tag as we know that pulling from a topic-partition is done within a single task (but one task can pull from multiple topic-partitions). Not adding the task tag reduces the potential number of generated metrics by 128, so it's worth it (it's would actually be too many metrics if we added it). Max expected metrics: 4 metrics * 10000 topic-partitions (overestimated).

Our plan is to devise the split the work in two tasks, first sending existing metrics and end-of-job timestamp, then updating Gobblin for the kafka new metrics and send them.

Thank you for the summary @JAllemandou, looks great! A few points in no particular order:

The Prometheus best practices on instrumenting batch jobs link we mentioned: https://prometheus.io/docs/practices/instrumentation/#batch-jobs
More information on choosing the "grouping key" for push gateway https://www.robustperception.io/choosing-your-pushgateway-grouping-key
Related to the above, I think we'll need to pay attention to job label specifically since that's also what we use to group more or less per service/daemon in Prometheus

I think that's it for now, feel free to add me to reviews and/or more tasks too. Thanks for trying out pushgateway and reaching out!

JAllemandou moved this task from In Progress to Paused on the Data-Engineering-Kanban board.Oct 27 2021, 10:31 AM

odimitrijevic moved this task from Ops Week to Apache Iceberg Migration on the Data-Engineering board.Oct 28 2021, 5:59 AM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:33 AM

JAllemandou closed subtask T294422: Add gobblin metrics per kafka-topic-partition as Invalid.Jan 12 2022, 7:32 PM

Change 772829 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/alerts@master] Add 2 new alerts for data-engineering gobblin

https://gerrit.wikimedia.org/r/772829

Ottomata claimed this task.Mar 22 2022, 1:37 PM

Ottomata moved this task from Paused to In Code Review on the Data-Engineering-Kanban board.

Change 772829 merged by Ottomata:

[operations/alerts@master] Add 2 new alerts for data-engineering gobblin