Page MenuHomePhabricator

Fully migrate producers off statsd
Open, MediumPublic

Description

This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying statsd_exporter to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in T145867: Test making thumbor statsd metrics available from Prometheus.

This is an audit on statsd traffic received on graphite host, sorted by "top level". There's some garbage/invalid names too in the list, to be ignored.

1658503192 MediaWiki
2104822039 restbase
310536074 ores
48342166 cpjobqueue
56600347 changeprop
62131440 aqs
71575793 frontend
81485833 logstash
91335847 eventbus
10 734305 mobileapps
11 632162 parsoid
12 430924 tilerator
13 406920 kartotherian
14 297334 eventstreams
15 64529 graphoid
16 29800 proton
17 21729 service_checker
18 21382 recommendation-api
19 12616 restbase-dev
20 10251 ve
21 9951 mw
22 555 webpagetest
23 549 eventlogging
24 464 tileratorui
25 307 browsertime
26 272 performance
27 247 wikibase
28 186 media
29 57 parsoid-tests
30 42 cloudvps

Generated with

timeout 10m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d.  | pigz -9c > statsd_users_10m.gz

zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m

Which as it turns out isn't the whole story: looking at graphite whisper files mtime a few infrequent statsd producers came up:

deploy
scap
gunicorn

To the list above of statsd traffic hitting statsd.eqiad.wmnet there's a few other producers that go by way of statsd -> statsite (on localhost) -> graphite protocol to graphite1004:

thumbor
swift
zuul

Annotated list of producers above, with plan of action:

statsv-produced metrics, see also T180105

  • mw.js.deprecate (generated client-side from mediawiki/extensions/WikimediaEvents)
  • mw.performance
  • browsertime (from WebPageReplay)
  • ve
  • Some metrics in MediaWiki hierarchy, e.g. minerva.WebClientError
  • pagepreviews (top level, PagePreviewsApiFailure/ PagePreviewsApiResponse/ PagePreviewsPreviewShow/)
  • media.thumbnail.client
  • webpagetest (generated by wpt-reporter from Jenkins)
  • wikibase.queryService.ui

navtiming-produced metrics, see also T175087

Prometheus counters for navtiming: https://gerrit.wikimedia.org/r/c/performance/navtiming/+/534771

  • frontend
  • mw.performance.save*
  • eventlogging.client_errors.navigation/paitingtiming
  • performance.survey

TODO

Use global aggregation / percentiles

See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic). Some of the services below use service-runner, for which some statsd metrics will need reconsideration (cfr T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats)

Dependent on Service-Runner:

  • aqs PR
  • changeprop/cpjobqueue - scheduled to be moved to k8s PR
  • eventstreams - scheduled to be moved to k8s PR
  • kartotherian, tilerator, tileratorui -- PR
  • mobileapps - some parts are moving to k8s PR
  • service-template-node PR
  • proton PR
  • recommendation-api PR
  • restbase - (parts?) moving to k8s PR
  • hyperswitch PR
  • citoid -- PR
  • service-template-node -- PR

Details

Related Gerrit Patches:
operations/puppet : productionlogstash: output webrequest 5xx metrics
mediawiki/services/eventstreams : masterUse new service-runner metrics for built in prometheus metrics
operations/puppet : productionlvs, monitoring: prometheus expects params value as string[] type
operations/puppet : productionmonitoring, profile, prometheus: bugfix, prometheus params values
operations/puppet : productionlvs, monitoring, prometheus: bugfix openapi exports
operations/puppet : productionlvs, prometheus, profile: add blackbox job helper and enable openapi scrapes
operations/puppet : productionscb: add graphoid matching rules and deploy statsd exporter to scb cluster
mediawiki/services/recommendation-api : masterupdate service-runner to 2.8.0 and implement new metrics api
mediawiki/services/chromium-render : masterupdate service-runner dependency to 2.8.0 and implement new metrics api
mediawiki/services/mobileapps : masterupdate service-runner to 2.8.0 and implement new metrics api
analytics/aqs : masterupdate service-runner to 2.8.0 and hyperswitch to 0.14.0
operations/puppet : productionproton: enable statsd_exporter and add matching rules to profile::proton
mediawiki/services/citoid : masterupdate service-runner to 2.8.0 and implement new metrics api
operations/puppet : productionhiera: update ores to pass statsd through the statsd exporter
operations/puppet : productionswift: stop relaying to statsd/statsite
operations/puppet : productionhiera: update ores to pass statsd through statsd_exporter
operations/puppet : productionprofile, prometheus, role: install swagger exporter on prometheus nodes
operations/puppet : productionlogstash: stop relaying to central statsd
operations/puppet : productionhiera: disable statsd_exporter::relay_address on logstash nodes
operations/puppet : productionprofile: use prometheus for logstash alerting
operations/puppet : productionprometheus: make statsd.relay-address toggle-able
operations/puppet : productionthumbor: stop relaying to statsd/statsite
operations/puppet : productionswift: remove statsite
operations/puppet : productionswift: stop relaying to statsd/statsite
operations/puppet : productionswift: port alerts to Prometheus
operations/puppet : productiongrafana: use Prometheus swift metrics for dashboard
operations/puppet : productionprometheus: collect swift account/container stats globally
operations/puppet : productionhiera: fix statsd rules
operations/puppet : productionlogstash: update statsd exporter mappings and use exporter
operations/deployment-charts : masteradd statsd_exporter config to mathoid
operations/puppet : productionscb: enable statsd_exporter and add matching rules
operations/puppet : productionmediawiki: enable statsd_exporter and add matching rules to appserver
operations/puppet : productionvarnish: enable statsd_exporter and add matching rules
operations/puppet : productionprofile: enable statsd_exporter and add matching rules to logstash::collector
operations/puppet : productionprofile: enable statsd_exporter and add matching rules to ores::worker
operations/puppet : productionci: use statsite for localhost statsd aggregation
operations/puppet : productionstatsite: move from role to profile
operations/puppet : productionhieradata: send periodic swift stats to localhost
operations/puppet : productionhieradata: switch all swift statsd traffic to statsd_exporter
operations/puppet : productionswift: add statsd mappings for periodic metrics
operations/puppet : productionswift: turn on statsd_exporter in eqiad
operations/puppet : productionswift: turn on statsd_exporter in codfw
operations/puppet : productionthumbor: relay statsd_exporter metrics to localhost
operations/puppet : productionswift: add statsd_port parameter
operations/puppet : productionhieradata: add statsd_exporter mappings for swift::storage
operations/puppet : productionhieradata: rename swift proxy statsd_exporter mapping
operations/puppet : productionswift: set statsd_exporter to relay to local statsd
operations/puppet : productionhieradata: add statsd_exporter mappings for swift-proxy
operations/puppet : productionswift: enable statsd_exporter
operations/puppet : productionprometheus: set defaults for statsd_exporter
operations/puppet : productionprometheus: add jobs for statsd_exporter
operations/puppet : productionthumbor: add missing statsd_exporter mappings
operations/puppet : productionthumbor: use statsd_exporter
operations/puppet : productionthumbor: add prometheus-statsd-exporter
operations/puppet : productionthumbor: set name for all statsd_exporter metrics
operations/puppet : productionthumbor: fix missing statsd_exporter mappings
operations/puppet : productionstatsd_exporter: fix commandline flags
operations/puppet : productionNew class: prometheus::statsd_exporter
operations/debs/prometheus-statsd-exporter : masterdebian: add patch for inline udp usage

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fgiunchedi updated the task description. (Show Details)Sep 5 2019, 9:29 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 9:30 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 10:14 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 10:19 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 10:40 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 12:46 PM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 1:07 PM

Change 535148 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: stop relaying to central statsd

https://gerrit.wikimedia.org/r/535148

Change 535149 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: collect swift account/container stats globally

https://gerrit.wikimedia.org/r/535149

Change 535149 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: collect swift account/container stats globally

https://gerrit.wikimedia.org/r/535149

Change 535180 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: use Prometheus swift metrics for dashboard

https://gerrit.wikimedia.org/r/535180

Change 535182 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: port object diff alerts to Prometheus

https://gerrit.wikimedia.org/r/535182

Change 535180 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: use Prometheus swift metrics for dashboard

https://gerrit.wikimedia.org/r/535180

Change 535515 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535515

Change 535182 merged by Filippo Giunchedi:
[operations/puppet@production] swift: port alerts to Prometheus

https://gerrit.wikimedia.org/r/535182

Change 535591 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535591

Change 535515 merged by Filippo Giunchedi:
[operations/puppet@production] swift: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535515

Change 536146 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: remove statsite

https://gerrit.wikimedia.org/r/536146

fgiunchedi updated the task description. (Show Details)Sep 12 2019, 12:32 PM
fgiunchedi updated the task description. (Show Details)Sep 12 2019, 12:38 PM

Change 536358 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: use prometheus for logstash alerting

https://gerrit.wikimedia.org/r/536358

Change 536365 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: make statsd.relay-address toggle-able

https://gerrit.wikimedia.org/r/536365

colewhite updated the task description. (Show Details)Sep 12 2019, 10:42 PM
fgiunchedi updated the task description. (Show Details)Sep 13 2019, 2:41 PM

Change 536146 merged by Filippo Giunchedi:
[operations/puppet@production] swift: remove statsite

https://gerrit.wikimedia.org/r/536146

Change 535591 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535591

Mentioned in SAL (#wikimedia-operations) [2019-09-16T12:44:03Z] <godog> stop thumbor traffic to statsd/graphite, use Prometheus only and replace Thumbor dashboard - T205870

fgiunchedi updated the task description. (Show Details)Sep 16 2019, 12:45 PM

Change 536365 merged by Cwhite:
[operations/puppet@production] prometheus: make statsd.relay-address toggle-able

https://gerrit.wikimedia.org/r/536365

Change 537561 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: disable statsd relay_address on logstash nodes

https://gerrit.wikimedia.org/r/537561

Change 536358 merged by Cwhite:
[operations/puppet@production] profile: use prometheus for logstash alerting

https://gerrit.wikimedia.org/r/536358

colewhite updated the task description. (Show Details)Sep 18 2019, 9:33 PM

Change 479139 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] ci: define statsd prometheus exporter mappings

https://gerrit.wikimedia.org/r/479139

Change 537561 merged by Cwhite:
[operations/puppet@production] hiera: disable statsd_exporter::relay_address on logstash nodes

https://gerrit.wikimedia.org/r/537561

colewhite updated the task description. (Show Details)Sep 19 2019, 10:41 PM
colewhite updated the task description. (Show Details)

Change 535148 abandoned by Filippo Giunchedi:
logstash: stop relaying to central statsd

Reason:
Obsoleted by I82d4f7be5

https://gerrit.wikimedia.org/r/535148

fgiunchedi updated the task description. (Show Details)Sep 24 2019, 12:46 PM
fgiunchedi updated the task description. (Show Details)Sep 24 2019, 12:50 PM
fgiunchedi updated the task description. (Show Details)Sep 24 2019, 2:32 PM

Change 538976 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores to pass statsd through statsd_exporter

https://gerrit.wikimedia.org/r/538976

Krinkle removed a subscriber: Krinkle.Sep 25 2019, 8:21 PM

Change 541619 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile, prometheus: install swagger exporter on icinga

https://gerrit.wikimedia.org/r/541619

Change 542472 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: added swagger exporter jobs at svc endpoints

https://gerrit.wikimedia.org/r/542472

Restricted Application added a subscriber: Masumrezarock100. · View Herald TranscriptOct 11 2019, 5:27 PM

Change 541619 merged by Cwhite:
[operations/puppet@production] profile, prometheus, role: install swagger exporter on prometheus nodes

https://gerrit.wikimedia.org/r/541619

Change 538976 merged by Alexandros Kosiaris:
[operations/puppet@production] hiera: update ores to pass statsd through statsd_exporter

https://gerrit.wikimedia.org/r/538976

colewhite renamed this task from Fully migrate >= 30% of producers off statsd to Fully migrate producers off statsd.Nov 6 2019, 4:37 PM
colewhite updated the task description. (Show Details)Nov 21 2019, 12:37 AM

Change 535188 merged by Filippo Giunchedi:
[operations/puppet@production] swift: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535188

Change 556052 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores to pass statsd through the statsd exporter

https://gerrit.wikimedia.org/r/556052

Change 556052 merged by Cwhite:
[operations/puppet@production] hiera: update ores to pass statsd through the statsd exporter

https://gerrit.wikimedia.org/r/556052

colewhite updated the task description. (Show Details)Dec 10 2019, 9:42 PM
colewhite updated the task description. (Show Details)Dec 11 2019, 4:49 PM
colewhite updated the task description. (Show Details)Dec 11 2019, 5:03 PM
colewhite updated the task description. (Show Details)Dec 12 2019, 11:13 PM

Change 556834 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/mobileapps@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/556834

Change 556420 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/citoid@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/556420

colewhite updated the task description. (Show Details)Dec 12 2019, 11:53 PM

Change 558184 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/recommendation-api@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/558184

colewhite updated the task description. (Show Details)Dec 16 2019, 7:52 PM

Change 558213 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/chromium-render@master] update service-runner dependency to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/558213

colewhite updated the task description. (Show Details)Dec 16 2019, 9:02 PM
colewhite updated the task description. (Show Details)Dec 16 2019, 10:17 PM
colewhite updated the task description. (Show Details)Dec 17 2019, 12:26 AM

Change 480259 abandoned by Cwhite:
proton: enable statsd_exporter and add matching rules to profile::proton

Reason:
in favor of https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/ /558213

https://gerrit.wikimedia.org/r/480259

Change 558696 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[analytics/aqs@master] update service-runner to 2.8.0 and hyperswitch to 0.14.0

https://gerrit.wikimedia.org/r/558696

colewhite updated the task description. (Show Details)Dec 17 2019, 10:13 PM

Change 558732 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] scb: add graphoid matching rules and deploy statsd exporter to scb cluster

https://gerrit.wikimedia.org/r/558732

colewhite updated the task description. (Show Details)Dec 17 2019, 11:44 PM

Change 559568 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/eventstreams@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/559568

colewhite updated the task description. (Show Details)Dec 19 2019, 7:22 PM

Change 558732 abandoned by Cwhite:
scb: add graphoid matching rules and deploy statsd exporter to scb cluster

Reason:
per https://phabricator.wikimedia.org/T211881#5509001

https://gerrit.wikimedia.org/r/558732

colewhite updated the task description. (Show Details)Dec 20 2019, 8:49 PM

Change 542472 merged by Cwhite:
[operations/puppet@production] lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes

https://gerrit.wikimedia.org/r/542472

Change 563283 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] lvs, monitoring, prometheus: bugfix openapi exports

https://gerrit.wikimedia.org/r/563283

Change 563283 merged by Cwhite:
[operations/puppet@production] lvs, monitoring, prometheus: bugfix openapi exports

https://gerrit.wikimedia.org/r/563283

Change 563301 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] monitoring, profile, prometheus: bugfix, prometheus params values

https://gerrit.wikimedia.org/r/563301

Change 563301 merged by Cwhite:
[operations/puppet@production] monitoring, profile, prometheus: bugfix, prometheus params values

https://gerrit.wikimedia.org/r/563301

Change 563306 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] lvs, monitoring: prometheus expects string[] type as value of params

https://gerrit.wikimedia.org/r/563306

Change 563306 merged by Cwhite:
[operations/puppet@production] lvs, monitoring: prometheus expects params value as string[] type

https://gerrit.wikimedia.org/r/563306

Change 559568 merged by Ottomata:
[mediawiki/services/eventstreams@master] Use new service-runner metrics for built in prometheus metrics

https://gerrit.wikimedia.org/r/559568

colewhite updated the task description. (Show Details)Mon, Mar 2, 10:14 PM

FYI, EventStreams is fully migrated to k8s and is using Cole's service-runner prometheus exporter code.

Dashboard here: https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams

@Pchelolo I think we should merge Cole's code into service-runner master and migrate more services to use it.