Page MenuHomePhabricator

Tech debt: sunsetting of Graphite
Open, MediumPublic

Description

This task tracks the Graphite deprecation.

Sunsetting Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective long-term, multidimensional metrics analysis and storage.

Wikitech: Graphite deprecation roadmap

Context: The SRE Observability team has been using Prometheus as its preferred metrics storage in production for several years. Prometheus offers key benefits over Graphite and a more modern ecosystem. The Prometheus stack provides more robust data labeling, storage, and query capabilities. This effort facilitates the improvement of our production metrics infrastructure and the deprecation of older systems.

The thought process behind the deprecation is outlined in T249164: RFC: Better interface for generating metrics in MediaWiki.

Read Only Date: Apr 30th 2025 (5pm UTC)

In this context we distinguish graphite as used by statsd (i.e. metrics are emitted via statsd over udp and the turned to graphite writes) which is tracked by T205870 and using graphite protocol directly (i.e. the application natively talks the graphite protocol, as opposed to statsd).

Migrate MediaWiki off Graphite
Migrate other graphite protocol users
Graphite Technical Deprecation
Phase out

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
Resolvedcolewhite
Resolved ACraze
Declinedcolewhite
Resolvedcolewhite
Resolvedcolewhite
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolvedherron
OpenNone
Resolved Pchelolo
Resolvedcolewhite
ResolvedJgiannelos
ResolvedKrinkle
DeclinedNone
ResolvedPeter
ResolvedKrinkle
OpenNone
OpenNone
OpenNone
OpenNone
InvalidNone
Resolvedfgiunchedi
Resolved lmata
Resolvedfgiunchedi
ResolvedSnwachukwu
ResolvedNone
Resolved lmata
Resolvedcolewhite
ResolvedPeter
ResolvedPeter
ResolvedPeter
ResolvedPeter
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1135088 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Stats: stop sending legacy metrics towards statsd

https://gerrit.wikimedia.org/r/1135088

Change #1135491 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: transform err field usage to hash

https://gerrit.wikimedia.org/r/1135491

The MW Prometheus Migration Dashboard is relevant again for this week.
https://grafana.wikimedia.org/d/nCxX65cSk/mediawiki-prometheus-migration?orgId=1&from=1744269513669&to=1744329600000

After the latest train, the difference in metric volume (metrics emitted ops/m) between Graphite and Prometheus is statistically insignificant. This slight difference indicates that it should be safe to set Graphite to read-only mode on the expected dates. Additionally, the remaining long-tail, unmigrated metrics may not pose a significant risk of future data loss.

Screenshot 2025-04-10 at 7.22.17 PM.png (1×2 px, 469 KB)

lmata updated the task description. (Show Details)

The MW Prometheus Migration Dashboard is relevant again for this week.

[…] Additionally, the remaining long-tail, unmigrated metrics may not pose a significant risk of future data loss.

I'd like to note for the record that your comment refers specifically to the availability of stats in Prometheus from MediaWiki PHP. That figure is now around 98%, which is pretty good indeed!

However, there are two other important data points:

  1. MediaWiki JS Progress: ~6%

This progress has only just begun as of January 2025, because migration was blocked for a year on T355837 until that was recently prioritized/resourced.

  1. Grafana dashboard progress: ~65%

The availability of the data in Prometheus does not mean that it is suitable and ready for use, or that the Graphite version of that same data isn't still relied upon. The performance issues around Prometheus (T371102) have meant that while the data is being fed to Prometheus and Graphite side-by-side, and a majority of dashboards have been semi-automatically converted to use Prometheus, there is a contingent of ~35% of dashboards that we use regularly still use Graphite. This isn't because they haven't been converted yet, but because they've been reverted back to Graphite due to these issues.

I’m afraid we need a bit more time from the WMDE side. We’ve been working hard on migrating most of the remaining stuff over the past weeks, but T389344: analytics/wmde/scripts Graphite to Prometheus migration is not done yet (and probably needs some WMF assistance – we need access to Prometheus Pushgateway configured, if I understand correctly), and if Graphite is made read-only today (in 25 minutes, apparently), then some alerts will immediately start to fire (because some of the metrics written by those scripts are used for alerts like “maxlag too high” or “edit rate too low”). Also, some of the Wikibase changes haven’t been merged yet (nor ridden the train).

@Lucas_Werkmeister_WMDE What is the timeline to complete this work from your end? Would moving the date by a week be good enough?

I think that would work for us, yes. It’s a heck of a lot more than *checks watch* five minutes.

To be more specific – I already submitted the Puppet change we need for the Puppet window (35 minutes from now). For the changes to the script itself, I think we should be able to get to a mergeable state tomorrow or Thursday, and then that would be deployed via Puppet quickly (we’re not tied to the train there).

The Wikibase changes will take a bit more work, but I think we can also find a mergeable form by Friday, and then backport them to the deployed train branches next Monday or Tuesday.

I also can’t say if a week is enough to migrate all the JavaScript metrics that are apparently missing according to @Krinkle in T228380#10741554.

For the changes to the script itself, I think we should be able to get to a mergeable state tomorrow or Thursday, and then that would be deployed via Puppet quickly (we’re not tied to the train there).

The Wikibase changes will take a bit more work, but I think we can also find a mergeable form by Friday, and then backport them to the deployed train branches next Monday or Tuesday.

Actually, @karapayneWMDE deftly reminded me that those days I mentioned include Good Friday and Easter Monday (idk about the US but both of those are holidays in Germany), so if you could extend it to Thursday that would be even better 😅

Hi!

Thank you for sharing your concerns; please know they’ve been heard and thoughtfully considered. Since the JS/mw.track() metrics path was unblocked in January, we’ve observed limited outreach from impacted JavaScript teams and metrics owners.

However, recognizing we’ve recently received another request for additional time, we’ve discussed it internally and agreed to extend the migration deadline another two weeks until the end of April (30th of April). This extension will give all teams more time and support to migrate fully before the environment switches to read-only.

However, it is worth noting that we have been sending periodic quarterly reminders and weekly updates and communicating that the Graphite RO (read-only) date was approaching. The read-only date stems a year backward from the date the underlying hardware is scheduled for EoL. The details have been documented in a decision brief outlining the rationale for the dates and the read-only configuration of graphite as part of the overall deprecation effort: https://docs.google.com/document/d/1bzNTJyzFCmPhpYMu2wywSxM6tSo6AxA5mx5l3zvDqX8/edit?tab=t.0.

If you or any team requires assistance or has concerns about your application’s metrics, please contact SRE Observability promptly so we can help you meet this extended timeline.

Hey @lmata ! Wikidata EM here. Thanks for extending the deadline! It is very much appreciated on our end. :)

As we were aware that y'all were sunsetting Graphite in Spring 2025, we began working on the migration of our stuff back in January. However, we only learned about the shut-off date of April 15 on March 15 with the announcement made on the wikitech mailing. While we did drop our other prios to meet the original deadline, our spike missed some key parts which we now will use the extra time to complete.

Also, as my team and I have a wikimedia germany account, we cannot access internal documents or slack channels unless granted permissions. I've reviewed what communication I have received, and I do not believe the specific date was communicated to us in an accessible way before March 14 at 21:00 CET. So I ask to please keep in mind the need to reach out to WMDE specifically about topics impacting Wikidata and Wikibase

Hey @lmata ! Wikidata EM here. Thanks for extending the deadline! It is very much appreciated on our end. :)

As we were aware that y'all were sunsetting Graphite in Spring 2025, we began working on the migration of our stuff back in January. However, we only learned about the shut-off date of April 15 on March 15 with the announcement made on the wikitech mailing. While we did drop our other prios to meet the original deadline, our spike missed some key parts which we now will use the extra time to complete.

There is an older announcement in wikitech in 14 Nov 2024 saying it'll be set to RO in April 2025: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/KLUV4IOLRYXPQFWD6WKKJUHMWE77BMSZ/

Disable/remove any unused metrics and dashboards first, then follow the migration process outlined in the task to *migrate all “in-use” metrics before the end of Q3 FY 2024/2025 (March 2025)*. *After this date, Graphite will be read-only, and no new data will be ingested. *

Hi all, quick question as we address this Graphite migration work:
Can someone confirm whether the following metric families have already been instrumented and are available in Prometheus?

PagePreviewsApiResponse.*
PagePreviewsPreviewShow.*
These power several Web Team dashboard widgets (e.g., API response time, TTP, preview count, etc.), and we're trying to determine if this data is already flowing into Prometheus, or if we need to reinstrument it ourselves.

Appreciate any guidance—thanks so much!

cc: @ssingh @Jdrewniak

Hi all, quick question as we address this Graphite migration work:
Can someone confirm whether the following metric families have already been instrumented and are available in Prometheus?

PagePreviewsApiResponse.*
PagePreviewsPreviewShow.*
These power several Web Team dashboard widgets (e.g., API response time, TTP, preview count, etc.), and we're trying to determine if this data is already flowing into Prometheus, or if we need to reinstrument it ourselves.

Appreciate any guidance—thanks so much!

cc: @ssingh @Jdrewniak

Hi @KSarabia-WMF!

From a cursory glance at codesearch, the Popups extension does not look migrated to the Prometheus-compatible interface.

Change #1139411 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Fastnetmon: permanently disable graphite

https://gerrit.wikimedia.org/r/1139411

Change #1139411 merged by Ayounsi:

[operations/puppet@production] Fastnetmon: permanently disable graphite

https://gerrit.wikimedia.org/r/1139411

Cross-posting from T388540#10774449: I have a question about what to do with gauges in MediaWiki-on-Prometheus.

While Prometheus counters are pretty straight-forward to aggregate, I'm not sure what to do with gauges.

https://grafana.wikimedia.org/d/BvWJlaDWk/startup-manifest-size

Screenshot 2025-04-28 at 21.35.47.png (770×2 px, 112 KB)

[…] the "old" data is still newly scraped every 30 seconds.

[When] a metric has any infrastructure-level labels unrelated to the MediaWiki application, that may alternate or otherwise change over time (i.e. data center, k8s pod template), then we're going to see echos for a while of stale data.

Is there a best practice for how to query these correctly such that when multiple are found, the correct/most recent is returned for any given interval point?

[For example] apply max() as a tie-braker. This is fine when aggregating/zooming out a multiple valid data points (e.g. zoom out from 5m to 1h and pick the max from that period), however for the above problem it just means data from days or weeks ago effectivelly overwrites recent data if it happens to be higher.

Is there a best practice for how to query these correctly such that when multiple are found, the correct/most recent is returned for any given interval point?

IIUC, you're referring to the tendency of once-reported metrics in a statsd-exporter instance lingering while the reporting function moves around the infrastructure?

If so, possibly pushgateway is appropriate if these are batch jobs? Or to switch to use an aggregation-compatible metric type?

@fgiunchedi, any ideas?

Yes in pushgateway you have a "grouping key" say for example job=foo and then can replace all metrics and their labels pushed under that grouping key.

Would statsd-exporter TTL help in this case to avoid metrics lingering around ?

Change #1135076 merged by Cwhite:

[operations/puppet@production] statsd: remove ferm rule for statsd port 8125

https://gerrit.wikimedia.org/r/1135076

Yes in pushgateway you have a "grouping key" say for example job=foo and then can replace all metrics and their labels pushed under that grouping key.

Would statsd-exporter TTL help in this case to avoid metrics lingering around ?

We have the TTL set to 30d across all instances. Related discussion: T359497: StatsD Exporter: gracefully handle metric signature changes

Change #1140721 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/extensions/Wikibase@master] Remove dependency on IBufferingStatsdDataFactory

https://gerrit.wikimedia.org/r/1140721

Change #1140721 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Remove dependency on IBufferingStatsdDataFactory

https://gerrit.wikimedia.org/r/1140721

Change #1135081 merged by jenkins-bot:

[mediawiki/core@master] MediaWikiEntryPoint: stop emitting legacy statsd metrics

https://gerrit.wikimedia.org/r/1135081

Change #1144553 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] zuul: disable statsd_exporter relaying to graphite

https://gerrit.wikimedia.org/r/1144553

Change #1144554 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] airflow: disable statsd_exporter relaying to graphite

https://gerrit.wikimedia.org/r/1144554

Change #1144555 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: remove access to port 2003 tcp/udp

https://gerrit.wikimedia.org/r/1144555

Another example of dashboard and set of metrics that appears to have no way to reliably plot results from Prometheus:

Dashboard: ResourceLoader Bundle size

Code change tracked in T355960: Migrate MediaWiki.resourceloader* metrics to statslib.

Before (Graphite)
MediaWiki.resourceloader_module_transfersize_bytes.$wiki.$component.$module
After (Prometheus)
sum (
    mediawiki_resourceloader_module_transfersize_bytes{component=~"$Component",wiki="$Wiki"})
)

The use of sum() was suggested in the Prometheus draft by @andrea.denisse, but this means that with every relevant statsd-exporter that traffic flows to, the values get multiplied. There doesn't appear to be a reliable way to plot these, since the multiple ingestion pathways will each continue to be re-crawled. While the number of duplicates is somewhat low for mwmaint (i.e. codfw and eqiad), it does change over time (new hostname), and after the k8s-mw-cron migration, multiplication will take off even further, both at the same time and over time.

I considered changing this to avg(), which would produce a more real-looking timeseries (the values _look_ realistic and are in the right order of magnitutude) but remains "fake" and not useful, since they would continue to invisibly incorporate random days/weeks old data (I say random because it isn't consistently biased in any particular direction or in otherwise related to or controlled by MediaWiki).

Example:

Screenshot 2025-05-12 at 20.38.03.png (1×2 px, 180 KB)

The value went up in this case, but it's not clear which one is "correct". Much less how to e.g. reliably alert on gauges like these.

Screenshot 2025-05-12 at 20.38.59.png (1×2 px, 246 KB)

Change #1144553 merged by Filippo Giunchedi:

[operations/puppet@production] zuul: disable statsd_exporter relaying to graphite

https://gerrit.wikimedia.org/r/1144553

Change #1144554 merged by Filippo Giunchedi:

[operations/puppet@production] airflow: disable statsd_exporter relaying to graphite

https://gerrit.wikimedia.org/r/1144554

Change #1144555 merged by Cwhite:

[operations/puppet@production] graphite: remove access to port 2003 tcp/udp

https://gerrit.wikimedia.org/r/1144555

@colewhite
Picking this thread back up.

From a cursory glance at codesearch, the Popups extension does not look migrated to the Prometheus-compatible interface.'

Is there an example I can follow of how to do this? The documentation has an example for PHP but not for client side.
I'm wondering if this is all I have to do or if I am missing something else.

Is there an example I can follow of how to do this? The documentation has an example for PHP but not for client side.
I'm wondering if this is all I have to do or if I am missing something else.

Client side statsv metrics uses mw.track(). mw.track interface docs can be viewed here: https://www.mediawiki.org/wiki/ResourceLoader/Core_modules#mw.track

The link you provided here to statsv.js in the Popups extension looks correct to me, but I don't see the calls to mw.track. If they result in calls to mw.track, they look right.

@colewhite Yeah, it ultimately flows to mw.track. I map it out here. Thank you!

I was using these graphs to monitor AbuseFilter performance—specifically, average execution time over time and conditions used (there is a hard limit of 2,000—previously 1,000—that is subject to being hit). Is there a replacement for this dashboard?

  • Hardware deprecation and sunsetting (ETA June 2026)

Would it be possible to get a placeholder task for that? (Or rather for actually turning off Graphite, I don't really care about the hardware itself.) I would like to reference that in our own tasks about cleaning up the (now-hidden) queries for Graphite in our Grafana Dashboard panels.

I was using these graphs to monitor AbuseFilter performance—specifically, average execution time over time and conditions used (there is a hard limit of 2,000—previously 1,000—that is subject to being hit). Is there a replacement for this dashboard?

Unfortunately, not yet. The task to migrate the metrics is here: T359359: Migrate AbuseFilter Extension to statslib

Change #1135088 merged by jenkins-bot:

[mediawiki/core@master] Stats: stop sending legacy metrics towards statsd

https://gerrit.wikimedia.org/r/1135088

Change #1184071 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] monitoring services: add migration task T228380 to instances

https://gerrit.wikimedia.org/r/1184071

Change #1184071 merged by Tiziano Fogli:

[operations/puppet@production] monitoring services: add migration task T228380 to instances

https://gerrit.wikimedia.org/r/1184071

Hi @VIGNERON 👋 I'm expecting that that same Grafana dashboard will be used in the future as the metrics should still be being sent to Prometheus rather than Graphite. Prioritization wise getting all the Grafana panels back has needed to take a back seat to some yearly planning and other pressing tasks, but I'm hoping to be able to focus back on it come December and finalize it by the end of the year.

Change #1204621 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] stats: Remove deprecate no-op calls to `copyToStatsdAt()`

https://gerrit.wikimedia.org/r/1204621

Change #1208427 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ParserOutputAccess/ParserCache: remove deprecated ::copyToStatsdAt()

https://gerrit.wikimedia.org/r/1208427

Change #1208427 merged by jenkins-bot:

[mediawiki/core@master] ParserOutputAccess/ParserCache: remove deprecated ::copyToStatsdAt()

https://gerrit.wikimedia.org/r/1208427

Change #1204621 merged by jenkins-bot:

[mediawiki/core@master] stats: Remove deprecate no-op calls to `copyToStatsdAt()`

https://gerrit.wikimedia.org/r/1204621

I'm expecting that that same Grafana dashboard will be used in the future as the metrics should still be being sent to Prometheus rather than Graphite. Prioritization wise getting all the Grafana panels back has needed to take a back seat to some yearly planning and other pressing tasks, but I'm hoping to be able to focus back on it come December and finalize it by the end of the year.

Hi @AndrewTavis_WMDE and happy new year! any update on this?

Happy new year, @VIGNERON! I'll check with Product on our end for priorities. I expect that we'll start again on the Grafana migration at the end of the month once an initial set of new year tasks are finished :)