Tech debt: sunsetting of Graphite
Open, MediumPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Jul 18 2019, 8:36 AM

Description

This task tracks the Graphite deprecation.

Sunsetting Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective long-term, multidimensional metrics analysis and storage.

Wikitech: Graphite deprecation roadmap

Context: The SRE Observability team has been using Prometheus as its preferred metrics storage in production for several years. Prometheus offers key benefits over Graphite and a more modern ecosystem. The Prometheus stack provides more robust data labeling, storage, and query capabilities. This effort facilitates the improvement of our production metrics infrastructure and the deprecation of older systems.

The thought process behind the deprecation is outlined in T249164: RFC: Better interface for generating metrics in MediaWiki.

In this context we distinguish graphite as used by statsd (i.e. metrics are emitted via statsd over udp and the turned to graphite writes) which is tracked by T205870 and using graphite protocol directly (i.e. the application natively talks the graphite protocol, as opposed to statsd).

Migrate MediaWiki off Graphite

Migrate other graphite protocol users

wikidata.rc hierarchy, via statistics::wmde::graphite on stat hosts
librenms hierarchy, from the software of the same name, tracked in T372457: Remove librenms -> graphite integration, replace with gnmi
T372855: migrate Data Platform Engineering maintained metrics from graphite to prometheus
T233089: Export zuul metrics to Prometheus

Graphite Technical Deprecation

Related Objects
Search...

Status	Assigned	Task
Open	None	T228380 Tech debt: sunsetting of Graphite
Open	None	T205870 Fully migrate producers off statsd
Resolved	colewhite	T233089 Export zuul metrics to Prometheus
Resolved	• ACraze	T233448 Review prometheus ORES rules for completeness
Declined	colewhite	T239833 StatsD Exporter drops relayed metrics
Resolved	colewhite	T240685 MediaWiki Prometheus support
Resolved	colewhite	T249164 RFC: Better interface for generating metrics in MediaWiki
Resolved	Krinkle	T292311 Create project tag for MediaWiki-libs-Metrics
Resolved	Krinkle	T292269 Decouple Profiler class from WebRequest and RequestContext
Resolved	Krinkle	T344748 MediaWiki Core - Review and merge StatsLib patch
Resolved	herron	T344751 Decide on default histogram buckets for MediaWiki timers
Open	None	T240995 AQS is not OpenAPI 3 compliant
Resolved	• Pchelolo	T241176 Review and release service-runner 2.8.0
Resolved	colewhite	T247820 Decide on `service-runner` aggregated prometheus metrics and use of `service` label
Resolved	Jgiannelos	T277857 Proton metrics broken
Open	None	T175087 Create a navtiming processor for Prometheus
Declined	None	T190936 navtiming.py: When processing metrics, include effectiveConnectionType as a factor
Resolved	Peter	T323124 Replace navtiming Platform tag ("site") with mw_skin
Resolved	Krinkle	T323129 Simulate client dispatch in a single scrape
Open	None	T321398 Move performance metrics from Graphite to Prometheus
Open	None	T325282 Update Grafana alerts to use metrics from Prometheus
Open	Peter	T325283 Update navtiming dashboards to use Prometheus metrics
Open	None	T325284 Update documentation to use Prometheus instead of Graphite
Open	None	T336764 Simplify navtiming multi-dc logic
Open	None	T293761 statsd and gunicorn metrics for superset
Resolved	fgiunchedi	T233956 Deploy Thanos (long-term storage) stateless components: sidecar and query
Open	None	T366292 Determine and implement steps needed to facilitate read-only graphite in production
Open	None	T372457 Remove librenms -> graphite integration, replace with gnmi
Open	None	T372855 migrate Data Platform Engineering maintained metrics from graphite to prometheus
Open	None	T372856 Configure graphite to be read only
Open	None	T379156 Change/fix real user performance alert to only use Prometheus

Event Timeline

fgiunchedi created this task.Jul 18 2019, 8:36 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2019, 8:36 AM

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jul 18 2019, 10:45 AM

fgiunchedi moved this task from Inbox to In progress on the observability board.Jul 22 2019, 3:14 PM

MoritzMuehlenhoff subscribed.Jul 22 2019, 4:18 PM

fgiunchedi mentioned this in T89857: scale statsd reporting/aggregation (plan).Aug 13 2019, 1:07 PM

fgiunchedi mentioned this in T85451: scale graphite deployment (tracking).

fgiunchedi mentioned this in T99125: Add role for StatsD and Graphite.

fgiunchedi added a subtask: T205870: Fully migrate producers off statsd.Aug 13 2019, 1:12 PM

Other Graphite producers found while auditing metrics changed in the last 7d

cassandra (from maps)
analytics.mw_api.varnish_requests (from analytics refinery job)
coal
librenms
daily from https://github.com/wikimedia/analytics-wmde-scripts and possibly others
"reportupdater-queries" https://github.com/wikimedia/analytics-reportupdater-queries

fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.Oct 18 2019, 1:40 PM

colewhite mentioned this in T240685: MediaWiki Prometheus support.Dec 13 2019, 3:40 PM

lmata subscribed.Jun 18 2020, 3:11 PM

fgiunchedi closed subtask T233956: Deploy Thanos (long-term storage) stateless components: sidecar and query as Resolved.Jul 8 2020, 8:44 AM

fgiunchedi moved this task from In progress to Backlog on the observability board.Jan 20 2021, 3:44 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:21 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata renamed this task from Tech debt: sunsetting of Graphite (part 1) (Q1 goal FY19-20) to Tech debt: sunsetting of Graphite (part 1) .Aug 9 2021, 1:14 AM

lmata edited projects, added SRE Observability (FY2021/2022-Q1); removed SRE Observability, Goal.

fgiunchedi moved this task from Up next to Backlog on the User-fgiunchedi board.Sep 13 2021, 12:19 PM

colewhite edited projects, added SRE Observability (FY2021/2022-Q2); removed SRE Observability (FY2021/2022-Q1).Oct 1 2021, 12:20 AM

lmata triaged this task as Medium priority.Nov 16 2021, 4:44 PM

lmata moved this task from FY2021/2022-Q2 to FY2021/2022-Q3 on the SRE Observability board.Jan 13 2022, 2:02 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q3); removed SRE Observability (FY2021/2022-Q2).

lmata edited projects, added SRE Observability (FY2021/2022-Q4); removed SRE Observability (FY2021/2022-Q3).Apr 11 2022, 1:15 PM

fgiunchedi edited projects, added SRE Observability (FY2022/2023-Q1); removed SRE Observability (FY2021/2022-Q4).Jul 1 2022, 8:17 AM

lmata edited projects, added Observability-Metrics; removed SRE Observability (FY2022/2023-Q1).Sep 13 2022, 1:29 AM

lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.

fgiunchedi removed a project: User-fgiunchedi.Nov 25 2022, 8:46 AM

In T228380#5519657, @fgiunchedi wrote:

Graphite producers found while auditing […]:

coal

…

T335242: Decommission 'coal' and 'coal-web' services

lmata mentioned this in T343020: Converting MediaWiki Metrics to StatsLib.Jul 28 2023, 5:45 PM

lmata updated the task description. (Show Details)Aug 2 2023, 2:52 PM

MSantos subscribed.Apr 10 2024, 5:15 PM

colewhite mentioned this in T363753: Only select o11y-owned datasources on the Grafana Datasource utilization dashboard.Apr 29 2024, 7:30 PM

fgiunchedi renamed this task from Tech debt: sunsetting of Graphite (part 1) to Tech debt: sunsetting of Graphite.May 20 2024, 1:31 PM

fgiunchedi updated the task description. (Show Details)

lmata updated the task description. (Show Details)May 20 2024, 3:54 PM

lmata updated the task description. (Show Details)May 21 2024, 3:33 PM

lmata updated the task description. (Show Details)May 30 2024, 3:00 PM

lmata updated the task description. (Show Details)May 30 2024, 3:04 PM

lmata added a project: SRE Observability (FY2024/2025-Q3).May 30 2024, 3:08 PM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2024/2025-Q3) board.

Michael subscribed.Jul 11 2024, 1:43 PM

Lucas_Werkmeister_WMDE mentioned this in T371520: graph for hits to Linked Data Endpoint (Special:EntityData) is broken.Jul 31 2024, 4:10 PM

Ottomata updated the task description. (Show Details)Aug 2 2024, 5:36 PM

Ottomata subscribed.

lmata updated the task description. (Show Details)Aug 7 2024, 2:23 PM

lmata updated the task description. (Show Details)

Peter subscribed.Aug 12 2024, 8:38 AM

Hi @lmata and @fgiunchedi I wanted to check where we are with the sunsetting of Graphite, is the plan early 2025 or late 2025? I'm thinking about the performance data that we still have in Graphite. I wanted to check how much blocker that part is and how it can be fixed so you can do the move?

Before the performance team was closed down we moved many metrics to Prometheus but it's not as battle tested and iterated as the metrics in Graphite and I guess we don't have a team that is responsible for it.

fgiunchedi updated the task description. (Show Details)Aug 14 2024, 8:40 AM

andrea.denisse subscribed.Aug 14 2024, 4:27 PM

Hi @Peter I'll reach out, we're in the final stages of getting the notices ready and posted, so I'll provide a preview so you dont have to wait.

Aklapper added a project: Technical-Debt.Aug 14 2024, 8:34 PM

lmata updated the task description. (Show Details)Aug 20 2024, 3:08 AM

lmata updated the task description. (Show Details)Aug 20 2024, 3:23 AM

lmata updated the task description. (Show Details)Aug 20 2024, 4:23 AM

Peter mentioned this in T373169: Iterate and update Web Team Site Performance dashboard.Aug 23 2024, 8:41 AM

Peter mentioned this in T375189: Document web performance alerts.Sep 19 2024, 12:56 PM

lmata mentioned this in T296295: Revise and improve Graphite backfill procedure.Oct 30 2024, 6:50 PM

lmata mentioned this in T152637: Document Graphite annotation (string events) usage.

Krinkle updated the task description. (Show Details)Thu, Dec 19, 10:03 PM