Page MenuHomePhabricator

Fully migrate producers off statsd
Open, MediumPublic

Description

This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying statsd_exporter to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in T145867: Test making thumbor statsd metrics available from Prometheus.

This is an audit on statsd traffic received on graphite host, sorted by "top level". There's some garbage/invalid names too in the list, to be ignored.

1658503192 MediaWiki
2104822039 restbase
310536074 ores
48342166 cpjobqueue
56600347 changeprop
62131440 aqs
71575793 frontend
81485833 logstash
91335847 eventbus
10 734305 mobileapps
11 632162 parsoid
12 430924 tilerator
13 406920 kartotherian
14 297334 eventstreams
15 64529 graphoid
16 29800 proton
17 21729 service_checker
18 21382 recommendation-api
19 12616 restbase-dev
20 10251 ve
21 9951 mw
22 555 webpagetest
23 549 eventlogging
24 464 tileratorui
25 307 browsertime
26 272 performance
27 247 wikibase
28 186 media
29 57 parsoid-tests
30 42 cloudvps

Generated with

timeout 10m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d.  | pigz -9c > statsd_users_10m.gz

zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m

Which as it turns out isn't the whole story: looking at graphite whisper files mtime a few infrequent statsd producers came up:

deploy
scap
gunicorn

To the list above of statsd traffic hitting statsd.eqiad.wmnet there's a few other producers that go by way of statsd -> statsite (on localhost) -> graphite protocol to graphite1004:

thumbor
swift
zuul

Annotated list of producers above, with plan of action:

statsv-produced metrics, see also T180105

metrics are in Prometheus. grafana dashboards need migrated

  • mw.js.deprecate (generated client-side from mediawiki/extensions/WikimediaEvents)
  • mw.performance
  • browsertime (from WebPageReplay)
  • ve
  • Some metrics in MediaWiki hierarchy, e.g. minerva.WebClientError
  • pagepreviews (top level, PagePreviewsApiFailure/ PagePreviewsApiResponse/ PagePreviewsPreviewShow/)
  • media.thumbnail.client
  • webpagetest (generated by wpt-reporter from Jenkins)
  • wikibase.queryService.ui

navtiming-produced metrics, see also T175087

Prometheus counters for navtiming: https://gerrit.wikimedia.org/r/c/performance/navtiming/+/534771

  • frontend
  • mw.performance.save*
  • eventlogging.client_errors.navigation/paitingtiming
  • performance.survey

TODO

Use global aggregation / percentiles

See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic). Some of the services below use service-runner, for which some statsd metrics will need reconsideration (cfr T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats)

Dependent on Service-Runner:

  • aqs PR
  • changeprop/cpjobqueue - scheduled to be moved to k8s PR
  • eventstreams - scheduled to be moved to k8s PR
  • eventgate
  • kartotherian, tilerator, tileratorui -- PR
  • mobileapps - some parts are moving to k8s PR
  • service-template-node PR
  • proton PR
  • recommendation-api PR
  • restbase - (parts?) moving to k8s PR
  • hyperswitch PR
  • citoid -- PR
  • mathoid -- PR

Details

ProjectBranchLines +/-Subject
mediawiki/services/recommendation-apimaster+15 -11
operations/deployment-chartsmaster+8 -83
mediawiki/services/citoidmaster+85 -22
mediawiki/services/citoidmaster+24 -16
mediawiki/coremaster+60 -0
mediawiki/coremaster+22 -0
mediawiki/coremaster+12 -2
mediawiki/coremaster+1 K -0
mediawiki/coremaster+996 -7
mediawiki/services/mathoidmaster+17 -11
operations/deployment-chartsmaster+1 -1
mediawiki/services/chromium-rendermaster+85 -26
mediawiki/services/mobileappsmaster+27 -17
operations/puppetproduction+8 -0
operations/puppetproduction+3 -1
mediawiki/services/eventstreamsmaster+46 -219
operations/puppetproduction+16 -16
operations/puppetproduction+26 -18
operations/puppetproduction+13 -9
operations/puppetproduction+211 -83
operations/puppetproduction+59 -0
analytics/aqsmaster+2 -2
operations/puppetproduction+79 -0
operations/puppetproduction+2 -0
operations/puppetproduction+0 -0
operations/puppetproduction+2 -0
operations/puppetproduction+16 -1
operations/puppetproduction+3 -1
operations/puppetproduction+4 -2
operations/puppetproduction+29 -29
operations/puppetproduction+7 -2
operations/puppetproduction+6 -2
operations/puppetproduction+3 -0
operations/puppetproduction+2 -2
operations/puppetproduction+64 -64
operations/puppetproduction+89 -94
operations/puppetproduction+3 -0
operations/puppetproduction+9 -9
operations/puppetproduction+22 -18
operations/deployment-chartsmaster+20 -0
operations/puppetproduction+37 -0
operations/puppetproduction+915 -0
operations/puppetproduction+164 -0
operations/puppetproduction+11 -0
operations/puppetproduction+204 -0
operations/puppetproduction+3 -2
operations/puppetproduction+6 -6
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/puppetproduction+42 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+3 -1
operations/puppetproduction+17 -10
operations/puppetproduction+187 -2
operations/puppetproduction+28 -28
operations/puppetproduction+6 -2
operations/puppetproduction+114 -2
operations/puppetproduction+2 -0
operations/puppetproduction+14 -1
operations/puppetproduction+18 -1
operations/puppetproduction+9 -0
operations/puppetproduction+3 -0
operations/puppetproduction+96 -0
operations/puppetproduction+1 -0
operations/puppetproduction+8 -0
operations/puppetproduction+1 -1
operations/puppetproduction+76 -0
operations/debs/prometheus-statsd-exportermaster+75 -0
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
Opencolewhite
OpenNone
Resolved ACraze
Declinedcolewhite
Opencolewhite
OpenNone
ResolvedKrinkle
ResolvedKrinkle
OpenNone
Resolved Pchelolo
Resolvedcolewhite
ResolvedJgiannelos
OpenPeter
DeclinedNone
ResolvedPeter
ResolvedKrinkle
OpenPeter
OpenNone
OpenNone
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 558184 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/recommendation-api@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/558184

Change 558213 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/chromium-render@master] update service-runner dependency to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/558213

Change 480259 abandoned by Cwhite:
proton: enable statsd_exporter and add matching rules to profile::proton

Reason:
in favor of https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/ /558213

https://gerrit.wikimedia.org/r/480259

Change 558696 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[analytics/aqs@master] update service-runner to 2.8.0 and hyperswitch to 0.14.0

https://gerrit.wikimedia.org/r/558696

Change 558732 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] scb: add graphoid matching rules and deploy statsd exporter to scb cluster

https://gerrit.wikimedia.org/r/558732

Change 559568 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/eventstreams@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/559568

Change 558732 abandoned by Cwhite:
scb: add graphoid matching rules and deploy statsd exporter to scb cluster

Reason:
per https://phabricator.wikimedia.org/T211881#5509001

https://gerrit.wikimedia.org/r/558732

Change 542472 merged by Cwhite:
[operations/puppet@production] lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes

https://gerrit.wikimedia.org/r/542472

Change 563283 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] lvs, monitoring, prometheus: bugfix openapi exports

https://gerrit.wikimedia.org/r/563283

Change 563283 merged by Cwhite:
[operations/puppet@production] lvs, monitoring, prometheus: bugfix openapi exports

https://gerrit.wikimedia.org/r/563283

Change 563301 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] monitoring, profile, prometheus: bugfix, prometheus params values

https://gerrit.wikimedia.org/r/563301

Change 563301 merged by Cwhite:
[operations/puppet@production] monitoring, profile, prometheus: bugfix, prometheus params values

https://gerrit.wikimedia.org/r/563301

Change 563306 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] lvs, monitoring: prometheus expects string[] type as value of params

https://gerrit.wikimedia.org/r/563306

Change 563306 merged by Cwhite:
[operations/puppet@production] lvs, monitoring: prometheus expects params value as string[] type

https://gerrit.wikimedia.org/r/563306

Change 559568 merged by Ottomata:
[mediawiki/services/eventstreams@master] Use new service-runner metrics for built in prometheus metrics

https://gerrit.wikimedia.org/r/559568

FYI, EventStreams is fully migrated to k8s and is using Cole's service-runner prometheus exporter code.

Dashboard here: https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams

@Pchelolo I think we should merge Cole's code into service-runner master and migrate more services to use it.

Change 585032 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/core@master] Metrics: Implement and enable statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/585032

Change 618388 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: disable statsd_exporter relay for ores

https://gerrit.wikimedia.org/r/618388

Change 618388 merged by Cwhite:
[operations/puppet@production] profile: disable statsd_exporter relay for ores

https://gerrit.wikimedia.org/r/618388

Change 480943 abandoned by Filippo Giunchedi:
[operations/puppet@production] logstash: output webrequest 5xx metrics

Reason:
Logstash statsd output isn't a thing anymore

https://gerrit.wikimedia.org/r/480943

colewhite updated the task description. (Show Details)

Change 558213 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] update service-runner dependency to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/558213

Change 693429 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[mediawiki/services/mathoid@master] Switch to native prometheus latency histograms

https://gerrit.wikimedia.org/r/693429

Change 693429 merged by jenkins-bot:

[mediawiki/services/mathoid@master] Switch to native prometheus latency histograms

https://gerrit.wikimedia.org/r/693429

Change 717115 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mathoid: Bump deployed version

https://gerrit.wikimedia.org/r/717115

Change 717115 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Bump deployed version

https://gerrit.wikimedia.org/r/717115

Change 721626 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 721627 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Add metrics configuration options

https://gerrit.wikimedia.org/r/721627

Change 721628 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

https://gerrit.wikimedia.org/r/721628

Change 721629 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

https://gerrit.wikimedia.org/r/721629

Change 721630 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: send MetricsFactory to emit step

https://gerrit.wikimedia.org/r/721630

Change 556420 merged by jenkins-bot:

[mediawiki/services/citoid@master] Update to service-template-node 0.10.0.

https://gerrit.wikimedia.org/r/556420

Change 724129 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] pass MetricsFactory instance to emitBufferedStatsdData in MWLBFactory

https://gerrit.wikimedia.org/r/724129

Change 721626 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 585032 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Implement and enable statsd-exporter compatible Metrics interface

Reason:

in favor of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721626/

https://gerrit.wikimedia.org/r/585032

Change 721629 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721629

Change 721628 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721628

Change 721627 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring and emit steps

https://gerrit.wikimedia.org/r/721627

Change 767180 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Move remaining metrics to prometheus

https://gerrit.wikimedia.org/r/767180

Change 767180 merged by jenkins-bot:

[mediawiki/services/citoid@master] Move remaining metrics to prometheus

https://gerrit.wikimedia.org/r/767180

Change 776233 had a related patch set uploaded (by Mvolz; author: PipelineBot):

[operations/deployment-charts@master] citoid: switch to native prometheus metrics

https://gerrit.wikimedia.org/r/776233

Change 776233 merged by jenkins-bot:

[operations/deployment-charts@master] citoid: switch to native prometheus metrics

https://gerrit.wikimedia.org/r/776233

This is now deployed for citoid.

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid

This is now deployed for citoid.

This is great to see! Thanks for your help @Mvolz

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

ack, please feel free to contact us (SRE o11y) for assistance with the missing/broken metrics if needed

This is now deployed for citoid.

This is great to see! Thanks for your help @Mvolz

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

Garbage collection metrics have been broken for a while, as have memory pod metrics, and that's not related to this change.

ack, please feel free to contact us (SRE o11y) for assistance with the missing/broken metrics if needed

I looked into a bit ago and didn't make any progress, and I'm not going to be able to look at it in the next two weeks either due to going away so if you'd like to have a look, be my guest! GC is broken for mathoid too, a bunch of zotero metrics also don't work (but not sure they ever did as it's not really tooled very well?)

I looked into a bit ago and didn't make any progress, and I'm not going to be able to look at it in the next two weeks either due to going away so if you'd like to have a look, be my guest! GC is broken for mathoid too, a bunch of zotero metrics also don't work (but not sure they ever did as it's not really tooled very well?)

GC metrics were removed in service-runner 2.9.0

I have updated grafana for the most part, however there are a few (minor) metrics this broke which relied on the service-runner native ones; quantiles by status and method are broken, but quantiles overall are still working. I'm not sure how to fix them but I'm not sure how essential those are since it's the same info but just broken down a bit.

I took a stab at fixing citoid quantiles by (method, endpoint, status) as well as total memory, top 5 pods memory, and traffic by http status. Please have a look to see if they're fixed in a way you would expect. If something seems amiss, please feel free to make any modification you deem appropriate.

A quick update on "high frequency" statsd producers sampled over 10 minutes on graphite1004. The list is getting shorter and shorter and that's great to see!

546061489 MediaWiki
35414078 restbase
1333791 aqs
 832382 frontend
 182788 kartotherian
   7935 ve
   4908 wikibase
   3380 service_checker
   3185 restbase-dev
   3013 mw
   2061 tilerator
   1915 performance
   1620 Vector
   1554 growthExperiments
    297 tileratorui
    120 gunicorn
     59 Wikidata
     16 eventlogging
     14 cloudvps