Page MenuHomePhabricator

Disable Data Platform Engineering generated graphite metrics and dashboards
Closed, ResolvedPublic5 Estimated Story Points

Description

Disable existing Airflow DAGs exporting metrics to Graphite.

Some Airflow DAGs seem to be configured to send metrics to Graphite.
e.g. https://gitlab.wikimedia.org/search?search=graphite&nav_source=navbar&project_id=93&group_id=189&search_code=true&repository_ref=main

Prometheus does not support historical data, so we cannot migrate these pipelines to produced Prometheus.

  • Review all in-use Airflow DAGs that emit metrics to Graphite
  • Contact teams/owners of these dags and work with them to use different means of viewing these metrics
  • Disable and remove all Graphite generating airflow dags
  • Where possible, delete relevant Grafana dashboards that use the relevant Graphite metrics.

Resources:

Details

Other Assignee
mforns

Event Timeline

Ottomata renamed this task from migrate airflow metrics from graphite to prometheus to migrate Data Platform Engineering maintained metrics from graphite to prometheus.Aug 20 2024, 4:13 PM
Ottomata added a project: Test Kitchen.
Ottomata subscribed.

I retitled to make it clear that the metrics are not about Airflow, but are about pipeline jobs that are scheduled by Airflow

@lmata is there a timeline for this sunsetting project? That will help us figure out when to focus on this work. Also is it tied to an OKR?

Hi @VirginiaPoundstone, the proposed timeline is to go read-only (Graphite) by the end of Q3-FY24/25; this is part of WE5.1.2. More info here: https://wikitech.wikimedia.org/wiki/Graphite/Deprecation_Roadmap. General announcements will follow in the next few days. Thanks!

Great. @Milimetric let's include this in the temp accounts work on pipelines.

cjming triaged this task as High priority.Nov 21 2024, 3:40 PM
cjming set the point value for this task to 5.
cjming moved this task from Incoming to Pipelines Backlog on the Test Kitchen board.
cjming added a project: Data Pipelines.

Hi all!

Just FYI, in case you aren't aware. It will likely be difficult or impossible to migrate airflow generated metrics from Graphite to Prometheus.

Prometheus does not support producing metrics for historical timestamps. Any metrics produced to it are given the current timestamp.

Graphite did support historical metrics; this is why it was used instead of Prometheus.

If the output of the job does not use historical metrics, then Prometheus will work just fine! Usually though, airflow scheduled jobs are expected to complete for a specific time period, and can be delayed and rerun in the past.


It may be possible, but not easy to do this.
https://dasl.cc/2024/07/07/setting-custom-timestamps-for-prometheus-metrics/

We'd have to work with SRE observability to see if setting custom timestamp and exporting historical metrics would be feasible.

I believe @otto's point above means we have to turn off these metrics for now, until we can find a place to send them that will work with historical metrics. There are really only three options:

  1. convince Prometheus to accept historical metrics
  2. do not retry missed DAG runs for Prometheus-bound pipelines and accept the resulting gaps in data
  3. maintain a different data store for historical metrics (preferably something we already have like an Iceberg table)

Druid is similar to Prometheus; it may be possible to push these metrics there.

However, I believe the reason they are in prometheus is to enable public dashboarding with Grafana, so Druid + Superset might not be sufficient.

  1. maintain a different data store for historical metrics (preferably something we already have like an Iceberg table)

+1
I tried this some years ago with the anomaly_detection table, but there were problems with Hive partitioning affecting Superset's querying performance.
Now, with Iceberg, that should be cool and useful!

@lmata what is the timeline for sunsetting Graphite?

@lmata, what is the timeline for sunsetting Graphite?

@VirginiaPoundstone The plan is to turn off metric intake by the end of this quarter, assuming a migration target of 90% is met by then;,. Otherwise, I've heard its possible to extend it to Q4, but that is not the preferred outcome.

Ottomata renamed this task from migrate Data Platform Engineering maintained metrics from graphite to prometheus to Disable Data Platform Engineering generated graphite metrics and dashboards.Feb 20 2025, 5:28 PM
Ottomata updated the task description. (Show Details)

The api metrics had been disabled. The wikidata metrics is pending,

Moving this ticket to blocked until towards end of march so that we can give the owners of wikidata metrics time to convert the metrics as discussed in this ticket.

@AndrewTavis_WMDE can we work with the date 28th March to finally disable the wikidata metric job in airflow?

@AndrewTavis_WMDE can we work with the date 28th March to finally disable the wikidata metric job in airflow?

Thanks for checking in on this, @Snwachukwu! I've communicated the above to stakeholders over here: @Ifrahkhanyaree_WMDE and @karapayneWMDE. I'm not sure that I'll be able to work on this this week, but if not I'd then have two weeks until the 28th, so basically one week and a few days to implement and a few days grace to make sure that all's working. This is certainly doable, but I'll need to have other things de-prioritized :) I'll confirm the decision on our end in the coming days!

Hey @Snwachukwu 👋 We're able to confirm March 28th as a goal for you all to disable the Wikidata metrics DAG. I'll try to get the individual DAGs migrated a few days before, and will be in touch on Slack if something comes up :) Will make new tasks as sub-tasks of T377352 today and subscribe you to them so you can follow the progress 😊

Hi @AndrewTavis_WMDE . Thank you for the confirmation. Please feel free to reach out if you need any form of support.

Cross posting from https://phabricator.wikimedia.org/T377352#10679786:

GitLab:data-engineering/airflow-dags#1185 adds five new DAGs for the following processes within wikidata_metrics_to_graphite_daily_dag that will populate tables in the Data Lake:

Tests have passed except for the approval job. The plan is to merge and auto-deploy tomorrow and then run the metrics from the 1st of January on. We'll then be good to disable wikidata_metrics_to_graphite_daily_dag by the March 28th deadline.

Cross posting from https://phabricator.wikimedia.org/T377352#10683430

GitLab:data-engineering/airflow-dags#1185 is merged, all five DAGs are deployed and we're all green for daily job success from January 1st 2025 on 💚 We're thus ready to turn off wikidata_metrics_to_graphite_daily_dag :)

@AndrewTavis_WMDE This is good news. Thank you so much for the effort put into this.

Happy to help, @Snwachukwu, and thanks for your coordination here! 😊