Page MenuHomePhabricator

Fully migrate producers off statsd
Open, MediumPublic

Description

This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying statsd_exporter to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in T145867: Test making thumbor statsd metrics available from Prometheus.

An audit can be generated with

timeout 10m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d.  | pigz -9c > statsd_users_10m.gz

zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m

Looking at graphite whisper files mtime a few infrequent statsd producers came up:

deploy
scap
gunicorn

There's a few other producers that go by way of statsd -> statsite (on localhost) -> graphite protocol to graphite1004:

thumbor
swift
zuul

Annotated list of producers above, with plan of action:

statsv-produced metrics, see also T180105

metrics are in Prometheus. grafana dashboards need migrated

  • mw.js.deprecate (generated client-side from mediawiki/extensions/WikimediaEvents)
  • mw.performance
  • browsertime (from WebPageReplay)
  • ve
  • Some metrics in MediaWiki hierarchy, e.g. minerva.WebClientError
  • pagepreviews (top level, PagePreviewsApiFailure/ PagePreviewsApiResponse/ PagePreviewsPreviewShow/)
  • media.thumbnail.client
  • webpagetest (generated by wpt-reporter from Jenkins)
  • wikibase.queryService.ui

navtiming-produced metrics, see also T175087

Prometheus counters for navtiming: https://gerrit.wikimedia.org/r/c/performance/navtiming/+/534771

  • frontend
  • mw.performance.save*
  • eventlogging.client_errors.navigation/paitingtiming
  • performance.survey

TODO

Use global aggregation / percentiles

See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic). Some of the services below use service-runner, for which some statsd metrics will need reconsideration (cfr T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats)

Dependent on Service-Runner:

  • aqs PR
  • changeprop/cpjobqueue - scheduled to be moved to k8s PR
  • eventstreams - scheduled to be moved to k8s PR
  • eventgate
  • kartotherian, tilerator, tileratorui -- PR
  • mobileapps - some parts are moving to k8s PR
  • service-template-node PR
  • proton PR
  • recommendation-api PR
  • restbase - (parts?) moving to k8s PR
  • hyperswitch PR
  • citoid -- PR
  • mathoid -- PR

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+2 -4
mediawiki/services/recommendation-apimaster+3 K -8 K
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -7
operations/deployment-chartsmaster+29 -5
mediawiki/services/recommendation-apimaster+4 -3
mediawiki/services/recommendation-apimaster+1 -1
mediawiki/services/recommendation-apimaster+15 -11
operations/puppetproduction+37 -0
operations/deployment-chartsmaster+8 -83
mediawiki/services/citoidmaster+85 -22
mediawiki/services/citoidmaster+24 -16
mediawiki/coremaster+60 -0
mediawiki/coremaster+22 -0
mediawiki/coremaster+12 -2
mediawiki/coremaster+1 K -0
mediawiki/coremaster+996 -7
mediawiki/services/mathoidmaster+17 -11
operations/deployment-chartsmaster+1 -1
mediawiki/services/chromium-rendermaster+85 -26
mediawiki/services/mobileappsmaster+27 -17
operations/puppetproduction+8 -0
operations/puppetproduction+3 -1
mediawiki/services/eventstreamsmaster+46 -219
operations/puppetproduction+16 -16
operations/puppetproduction+26 -18
operations/puppetproduction+13 -9
operations/puppetproduction+211 -83
operations/puppetproduction+59 -0
analytics/aqsmaster+2 -2
operations/puppetproduction+79 -0
operations/puppetproduction+2 -0
operations/puppetproduction+0 -0
operations/puppetproduction+2 -0
operations/puppetproduction+16 -1
operations/puppetproduction+3 -1
operations/puppetproduction+4 -2
operations/puppetproduction+29 -29
operations/puppetproduction+7 -2
operations/puppetproduction+6 -2
operations/puppetproduction+3 -0
operations/puppetproduction+2 -2
operations/puppetproduction+64 -64
operations/puppetproduction+89 -94
operations/puppetproduction+3 -0
operations/puppetproduction+9 -9
operations/puppetproduction+22 -18
operations/deployment-chartsmaster+20 -0
operations/puppetproduction+915 -0
operations/puppetproduction+164 -0
operations/puppetproduction+11 -0
operations/puppetproduction+204 -0
operations/puppetproduction+3 -2
operations/puppetproduction+6 -6
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/puppetproduction+42 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+3 -1
operations/puppetproduction+17 -10
operations/puppetproduction+187 -2
operations/puppetproduction+28 -28
operations/puppetproduction+6 -2
operations/puppetproduction+114 -2
operations/puppetproduction+2 -0
operations/puppetproduction+14 -1
operations/puppetproduction+18 -1
operations/puppetproduction+9 -0
operations/puppetproduction+3 -0
operations/puppetproduction+96 -0
operations/puppetproduction+1 -0
operations/puppetproduction+8 -0
operations/puppetproduction+1 -1
operations/puppetproduction+76 -0
operations/debs/prometheus-statsd-exportermaster+75 -0
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
Resolvedcolewhite
Resolved ACraze
Declinedcolewhite
Resolvedcolewhite
Resolvedcolewhite
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolvedherron
OpenNone
Resolved Pchelolo
Resolvedcolewhite
ResolvedJgiannelos
OpenNone
DeclinedNone
ResolvedPeter
ResolvedKrinkle
OpenNone
OpenNone
OpenPeter
OpenNone
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 693429 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/mathoid@master] Switch to native prometheus latency histograms

https://gerrit.wikimedia.org/r/693429

Change 693429 merged by jenkins-bot:

[mediawiki/services/mathoid@master] Switch to native prometheus latency histograms

https://gerrit.wikimedia.org/r/693429

Change 717115 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mathoid: Bump deployed version

https://gerrit.wikimedia.org/r/717115

Change 717115 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Bump deployed version

https://gerrit.wikimedia.org/r/717115

Change 721626 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 721627 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Add metrics configuration options

https://gerrit.wikimedia.org/r/721627

Change 721628 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

https://gerrit.wikimedia.org/r/721628

Change 721629 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

https://gerrit.wikimedia.org/r/721629

Change 721630 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: send MetricsFactory to emit step

https://gerrit.wikimedia.org/r/721630

Change 556420 merged by jenkins-bot:

[mediawiki/services/citoid@master] Update to service-template-node 0.10.0.

https://gerrit.wikimedia.org/r/556420

Change 724129 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] pass MetricsFactory instance to emitBufferedStatsdData in MWLBFactory

https://gerrit.wikimedia.org/r/724129

Change 721626 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 585032 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Implement and enable statsd-exporter compatible Metrics interface

Reason:

in favor of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721626/

https://gerrit.wikimedia.org/r/585032

Change 721629 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721629

Change 721628 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721628

Change 721627 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring and emit steps

https://gerrit.wikimedia.org/r/721627

Change 767180 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Move remaining metrics to prometheus

https://gerrit.wikimedia.org/r/767180

Change 767180 merged by jenkins-bot:

[mediawiki/services/citoid@master] Move remaining metrics to prometheus

https://gerrit.wikimedia.org/r/767180

Change 776233 had a related patch set uploaded (by Mvolz; author: PipelineBot):

[operations/deployment-charts@master] citoid: switch to native prometheus metrics

https://gerrit.wikimedia.org/r/776233

Change 776233 merged by jenkins-bot:

[operations/deployment-charts@master] citoid: switch to native prometheus metrics

https://gerrit.wikimedia.org/r/776233

This is now deployed for citoid.

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid

This is now deployed for citoid.

This is great to see! Thanks for your help @Mvolz

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

ack, please feel free to contact us (SRE o11y) for assistance with the missing/broken metrics if needed

This is now deployed for citoid.

This is great to see! Thanks for your help @Mvolz

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

ack, please feel free to contact us (SRE o11y) for assistance with the missing/broken metrics if needed

I looked into a bit ago and didn't make any progress, and I'm not going to be able to look at it in the next two weeks either due to going away so if you'd like to have a look, be my guest! GC is broken for mathoid too, a bunch of zotero metrics also don't work (but not sure they ever did as it's not really tooled very well?)

I looked into a bit ago and didn't make any progress, and I'm not going to be able to look at it in the next two weeks either due to going away so if you'd like to have a look, be my guest! GC is broken for mathoid too, a bunch of zotero metrics also don't work (but not sure they ever did as it's not really tooled very well?)

GC metrics were removed in service-runner 2.9.0

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

I took a stab at fixing citoid quantiles by (method, endpoint, status) as well as total memory, top 5 pods memory, and traffic by http status. Please have a look to see if they're fixed in a way you would expect. If something seems amiss, please feel free to make any modification you deem appropriate.

A quick update on "high frequency" statsd producers sampled over 10 minutes on graphite1004. The list is getting shorter and shorter and that's great to see!

546061489 MediaWiki
35414078 restbase
1333791 aqs
 832382 frontend
 182788 kartotherian
   7935 ve
   4908 wikibase
   3380 service_checker
   3185 restbase-dev
   3013 mw
   2061 tilerator
   1915 performance
   1620 Vector
   1554 growthExperiments
    297 tileratorui
    120 gunicorn
     59 Wikidata
     16 eventlogging
     14 cloudvps

Change 484586 abandoned by Cwhite:

[operations/puppet@production] scb: enable statsd_exporter and add matching rules

Reason:

https://gerrit.wikimedia.org/r/484586

Change 558184 merged by jenkins-bot:

[mediawiki/services/recommendation-api@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/558184

Hi @colewhite! I worked with James to port the Recommendation-api to nodejs 18, and one of the patches that we merged is:

https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/558184

When we deploy the new code, I see this error (and no metrics reported):

{"name":"recommendation-api","hostname":"recommendation-api-production-5459988bb6-g4n7q","pid":17,"level":"ERROR","levelPath":"error/metrics","msg":"endTiming() unsupported for metric type Gauge","time":"2023-12-05T15:18:14.563Z","v":0}

We rolled back, but I am wondering if you have more info (I have limited knowledge about what service-runner does behind the scenes). If we solve this problem we'll be able to deploy anytime :)

Change 981645 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/recommendation-api@master] Use "set" instead of "endTiming" in makeMetric

https://gerrit.wikimedia.org/r/981645

Change 981645 merged by Elukey:

[mediawiki/services/recommendation-api@master] Use "set" instead of "endTiming" in makeMetric

https://gerrit.wikimedia.org/r/981645

Change 982047 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/recommendation-api@master] Move the only metric produced from Gauge to Histogram

https://gerrit.wikimedia.org/r/982047

@colewhite hi again! I added some context to https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/982047, now I have a better idea about what's happening. Lemme know what's best and if I am missing something!

Change 982047 merged by jenkins-bot:

[mediawiki/services/recommendation-api@master] Move the only metric produced from Gauge to Histogram

https://gerrit.wikimedia.org/r/982047

Change 983403 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] recommendation-api: update statsd configuration

https://gerrit.wikimedia.org/r/983403

Change 983403 merged by jenkins-bot:

[operations/deployment-charts@master] recommendation-api: update monitoring config

https://gerrit.wikimedia.org/r/983403

Change 983694 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update Docker image and settings for Recommendation API

https://gerrit.wikimedia.org/r/983694

Change 983694 merged by Elukey:

[operations/deployment-charts@master] services: update Docker image and settings for Recommendation API

https://gerrit.wikimedia.org/r/983694

Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick look it seems that we define the new metric as "seconds" but in reality the value that it is carries is in ms.

Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick look it seems that we define the new metric as "seconds" but in reality the value that it is carries is in ms.

This was resolved in service-runner 3.1.0. Will recommendation-api work with that version?

Change 984103 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/recommendation-api@master] Upgrade to service-runner 3.1.0

https://gerrit.wikimedia.org/r/984103

Change 984103 merged by jenkins-bot:

[mediawiki/services/recommendation-api@master] Upgrade to service-runner 3.1.0

https://gerrit.wikimedia.org/r/984103

Change 984131 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update rec-api's staging Docker image

https://gerrit.wikimedia.org/r/984131

Change 984131 merged by Elukey:

[operations/deployment-charts@master] services: update rec-api's staging Docker image

https://gerrit.wikimedia.org/r/984131

Tried to deploy rec-api without the statsd exporter, all good but the metrics are still not 100% ok. From a quick look it seems that we define the new metric as "seconds" but in reality the value that it is carries is in ms.

This was resolved in service-runner 3.1.0. Will recommendation-api work with that version?

Done! It is running in staging :)

Change #1018717 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update the rec-api's Docker image

https://gerrit.wikimedia.org/r/1018717

Change #1018717 merged by Elukey:

[operations/deployment-charts@master] services: update the rec-api's Docker image

https://gerrit.wikimedia.org/r/1018717

Mentioned in SAL (#wikimedia-operations) [2024-06-10T13:36:40Z] <elukey> move recommendation-api on wikikube to prometheus metrics (offboarded from statsd) - T205870

@colewhite o/ I finally deployed recommendation-api, and this time it looks good. I updated also its dashboard:

https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1

I see some differences with old/new metrics, but I believe they are due to a better granularity with Prometheus metrics.

This is the snapshot before/after the deployment: https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1&from=1718024904966&to=1718029628948

We should be good, if so we can tick-off rec-api :)

Aklapper added a subscriber: colewhite.

@colewhite: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on October 11th.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!