Page MenuHomePhabricator

Fully migrate producers off statsd
Open, MediumPublic

Description

This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying statsd_exporter to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in T145867: Test making thumbor statsd metrics available from Prometheus.

This is an audit on statsd traffic received on graphite host, sorted by "top level". There's some garbage/invalid names too in the list, to be ignored.

1658503192 MediaWiki
2104822039 restbase
310536074 ores
48342166 cpjobqueue
56600347 changeprop
62131440 aqs
71575793 frontend
81485833 logstash
91335847 eventbus
10 734305 mobileapps
11 632162 parsoid
12 430924 tilerator
13 406920 kartotherian
14 297334 eventstreams
15 64529 graphoid
16 29800 proton
17 21729 service_checker
18 21382 recommendation-api
19 12616 restbase-dev
20 10251 ve
21 9951 mw
22 555 webpagetest
23 549 eventlogging
24 464 tileratorui
25 307 browsertime
26 272 performance
27 247 wikibase
28 186 media
29 57 parsoid-tests
30 42 cloudvps

Generated with

timeout 10m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d.  | pigz -9c > statsd_users_10m.gz

zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m

Which as it turns out isn't the whole story: looking at graphite whisper files mtime a few infrequent statsd producers came up:

deploy
scap
gunicorn

To the list above of statsd traffic hitting statsd.eqiad.wmnet there's a few other producers that go by way of statsd -> statsite (on localhost) -> graphite protocol to graphite1004:

thumbor
swift
zuul

Annotated list of producers above, with plan of action:

statsv-produced metrics, see also T180105

  • mw.js.deprecate (generated client-side from mediawiki/extensions/WikimediaEvents)
  • mw.performance
  • browsertime (from WebPageReplay)
  • ve
  • Some metrics in MediaWiki hierarchy, e.g. minerva.WebClientError
  • pagepreviews (top level, PagePreviewsApiFailure/ PagePreviewsApiResponse/ PagePreviewsPreviewShow/)
  • media.thumbnail.client
  • webpagetest (generated by wpt-reporter from Jenkins)
  • wikibase.queryService.ui

navtiming-produced metrics, see also T175087

Prometheus counters for navtiming: https://gerrit.wikimedia.org/r/c/performance/navtiming/+/534771

  • frontend
  • mw.performance.save*
  • eventlogging.client_errors.navigation/paitingtiming
  • performance.survey

TODO

Use global aggregation / percentiles

See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic). Some of the services below use service-runner, for which some statsd metrics will need reconsideration (cfr T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats)

Dependent on Service-Runner:

  • aqs
  • changeprop - scheduled to be moved to k8s
  • cpjobqueue - scheduled to be moved to k8s
  • eventstreams - scheduled to be moved to k8s
  • graphoid - under code stewardship review
  • kartotherian, tilerator, tileratorui -- PR
  • mobileapps - some parts are moving to k8s PR
  • service-template-node PR
  • proton
  • recommendation-api
  • restbase - (parts?) moving to k8s PR
  • hyperswitch PR
  • citoid -- PR
  • service-template-node -- PR

Details

Related Gerrit Patches:
mediawiki/services/mobileapps : masterupdate service-runner to 2.8.0 and implement new metrics api
mediawiki/services/citoid : masterupdate service-runner to 2.8.0 and implement new metrics api
operations/puppet : productionlvs, prometheus, profile: add blackbox job helper and enable openapi scrapes
operations/puppet : productionhiera: update ores to pass statsd through the statsd exporter
operations/puppet : productionswift: stop relaying to statsd/statsite
operations/puppet : productionhiera: update ores to pass statsd through statsd_exporter
operations/puppet : productionprofile, prometheus, role: install swagger exporter on prometheus nodes
operations/puppet : productionlogstash: stop relaying to central statsd
operations/puppet : productionhiera: disable statsd_exporter::relay_address on logstash nodes
operations/puppet : productionprofile: use prometheus for logstash alerting
operations/puppet : productionprometheus: make statsd.relay-address toggle-able
operations/puppet : productionthumbor: stop relaying to statsd/statsite
operations/puppet : productionswift: remove statsite
operations/puppet : productionswift: stop relaying to statsd/statsite
operations/puppet : productionswift: port alerts to Prometheus
operations/puppet : productiongrafana: use Prometheus swift metrics for dashboard
operations/puppet : productionprometheus: collect swift account/container stats globally
operations/puppet : productionhiera: fix statsd rules
operations/puppet : productionlogstash: update statsd exporter mappings and use exporter
operations/deployment-charts : masteradd statsd_exporter config to mathoid
operations/puppet : productionlogstash: output webrequest 5xx metrics
operations/puppet : productionscb: enable statsd_exporter and add matching rules
operations/puppet : productionmediawiki: enable statsd_exporter and add matching rules to appserver
operations/puppet : productionvarnish: enable statsd_exporter and add matching rules
operations/puppet : productionproton: enable statsd_exporter and add matching rules to profile::proton
operations/puppet : productionprofile: enable statsd_exporter and add matching rules to logstash::collector
operations/puppet : productionprofile: enable statsd_exporter and add matching rules to ores::worker
operations/puppet : productionci: use statsite for localhost statsd aggregation
operations/puppet : productionstatsite: move from role to profile
operations/puppet : productionhieradata: send periodic swift stats to localhost
operations/puppet : productionhieradata: switch all swift statsd traffic to statsd_exporter
operations/puppet : productionswift: add statsd mappings for periodic metrics
operations/puppet : productionswift: turn on statsd_exporter in eqiad
operations/puppet : productionswift: turn on statsd_exporter in codfw
operations/puppet : productionthumbor: relay statsd_exporter metrics to localhost
operations/puppet : productionswift: add statsd_port parameter
operations/puppet : productionhieradata: add statsd_exporter mappings for swift::storage
operations/puppet : productionhieradata: rename swift proxy statsd_exporter mapping
operations/puppet : productionswift: set statsd_exporter to relay to local statsd
operations/puppet : productionhieradata: add statsd_exporter mappings for swift-proxy
operations/puppet : productionswift: enable statsd_exporter
operations/puppet : productionprometheus: set defaults for statsd_exporter
operations/puppet : productionprometheus: add jobs for statsd_exporter
operations/puppet : productionthumbor: add missing statsd_exporter mappings
operations/puppet : productionthumbor: use statsd_exporter
operations/puppet : productionthumbor: add prometheus-statsd-exporter
operations/puppet : productionthumbor: set name for all statsd_exporter metrics
operations/puppet : productionthumbor: fix missing statsd_exporter mappings
operations/puppet : productionstatsd_exporter: fix commandline flags
operations/puppet : productionNew class: prometheus::statsd_exporter
operations/debs/prometheus-statsd-exporter : masterdebian: add patch for inline udp usage

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
CDanis added a subscriber: CDanis.Dec 19 2018, 4:57 PM

Change 480943 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: output webrequest 5xx metrics

https://gerrit.wikimedia.org/r/480943

Change 481110 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] mediawiki: enable statsd_exporter and add matching rules to appserver

https://gerrit.wikimedia.org/r/481110

Change 482350 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] varnish: enable statsd_exporter and add matching rules

https://gerrit.wikimedia.org/r/482350

Change 482718 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/deployment-charts@master] add statsd_exporter config to mathoid

https://gerrit.wikimedia.org/r/482718

colewhite updated the task description. (Show Details)Jan 11 2019, 5:45 PM
fgiunchedi updated the task description. (Show Details)Jan 14 2019, 10:21 AM

Change 484586 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] scb: enable statsd_exporter and add matching rules

https://gerrit.wikimedia.org/r/484586

Change 482718 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] add statsd_exporter config to mathoid

https://gerrit.wikimedia.org/r/482718

Status update: 4 services (swift / ores / thumbor / logstash) have their metrics collected by Prometheus by virtue of using statsd_exporter out of 40, so 10% ATM. mathoid on k8s is on its way to have Prometheus metrics too.

For service owners moving their services to k8s and based on the experience in this task we've developed guidelines on how to write statsd_exporter mappings at https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s which will need more feedback/scrutiny.

Mediawiki remains the biggest producer of statsd traffic ATM and contains a multitude of metrics, sometimes inconsistent, as highlighted by Krinkle on https://gerrit.wikimedia.org/r/c/operations/puppet/+/481110 we'll need to think of a more streamlined approach to tackle mw metrics for Prometheus.

fgiunchedi moved this task from In progress to Up next on the observability board.Mar 18 2019, 1:58 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-20T10:26:57Z] <godog> reimage prometheus1003 with stretch - T205870

Change 526782 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] logstash: update statsd exporter mappings and use exporter

https://gerrit.wikimedia.org/r/526782

Change 526782 merged by Cwhite:
[operations/puppet@production] logstash: update statsd exporter mappings and use exporter

https://gerrit.wikimedia.org/r/526782

Change 527221 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: fix statsd rules

https://gerrit.wikimedia.org/r/527221

Change 527221 merged by Cwhite:
[operations/puppet@production] hiera: fix statsd rules

https://gerrit.wikimedia.org/r/527221

fgiunchedi renamed this task from Provision >= 50% of statsd/Graphite-only metrics in Prometheus to Fully migrate >= 30% of producers off statsd.Aug 13 2019, 1:12 PM
fgiunchedi updated the task description. (Show Details)
fgiunchedi updated the task description. (Show Details)Sep 5 2019, 9:29 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 9:30 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 10:14 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 10:19 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 10:40 AM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 12:46 PM
fgiunchedi updated the task description. (Show Details)Sep 6 2019, 1:07 PM

Change 535148 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: stop relaying to central statsd

https://gerrit.wikimedia.org/r/535148

Change 535149 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: collect swift account/container stats globally

https://gerrit.wikimedia.org/r/535149

Change 535149 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: collect swift account/container stats globally

https://gerrit.wikimedia.org/r/535149

Change 535180 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] grafana: use Prometheus swift metrics for dashboard

https://gerrit.wikimedia.org/r/535180

Change 535182 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: port object diff alerts to Prometheus

https://gerrit.wikimedia.org/r/535182

Change 535180 merged by Filippo Giunchedi:
[operations/puppet@production] grafana: use Prometheus swift metrics for dashboard

https://gerrit.wikimedia.org/r/535180

Change 535515 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535515

Change 535182 merged by Filippo Giunchedi:
[operations/puppet@production] swift: port alerts to Prometheus

https://gerrit.wikimedia.org/r/535182

Change 535591 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535591

Change 535515 merged by Filippo Giunchedi:
[operations/puppet@production] swift: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535515

Change 536146 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: remove statsite

https://gerrit.wikimedia.org/r/536146

fgiunchedi updated the task description. (Show Details)Sep 12 2019, 12:32 PM
fgiunchedi updated the task description. (Show Details)Sep 12 2019, 12:38 PM

Change 536358 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: use prometheus for logstash alerting

https://gerrit.wikimedia.org/r/536358

Change 536365 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: make statsd.relay-address toggle-able

https://gerrit.wikimedia.org/r/536365

colewhite updated the task description. (Show Details)Sep 12 2019, 10:42 PM
fgiunchedi updated the task description. (Show Details)Sep 13 2019, 2:41 PM

Change 536146 merged by Filippo Giunchedi:
[operations/puppet@production] swift: remove statsite

https://gerrit.wikimedia.org/r/536146

Change 535591 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535591

Mentioned in SAL (#wikimedia-operations) [2019-09-16T12:44:03Z] <godog> stop thumbor traffic to statsd/graphite, use Prometheus only and replace Thumbor dashboard - T205870

fgiunchedi updated the task description. (Show Details)Sep 16 2019, 12:45 PM

Change 536365 merged by Cwhite:
[operations/puppet@production] prometheus: make statsd.relay-address toggle-able

https://gerrit.wikimedia.org/r/536365

Change 537561 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: disable statsd relay_address on logstash nodes

https://gerrit.wikimedia.org/r/537561

Change 536358 merged by Cwhite:
[operations/puppet@production] profile: use prometheus for logstash alerting

https://gerrit.wikimedia.org/r/536358

colewhite updated the task description. (Show Details)Sep 18 2019, 9:33 PM

Change 479139 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] ci: define statsd prometheus exporter mappings

https://gerrit.wikimedia.org/r/479139

Change 537561 merged by Cwhite:
[operations/puppet@production] hiera: disable statsd_exporter::relay_address on logstash nodes

https://gerrit.wikimedia.org/r/537561

colewhite updated the task description. (Show Details)Sep 19 2019, 10:41 PM
colewhite updated the task description. (Show Details)

Change 535148 abandoned by Filippo Giunchedi:
logstash: stop relaying to central statsd

Reason:
Obsoleted by I82d4f7be5

https://gerrit.wikimedia.org/r/535148

fgiunchedi updated the task description. (Show Details)Sep 24 2019, 12:46 PM
fgiunchedi updated the task description. (Show Details)Sep 24 2019, 12:50 PM
fgiunchedi updated the task description. (Show Details)Sep 24 2019, 2:32 PM

Change 538976 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores to pass statsd through statsd_exporter

https://gerrit.wikimedia.org/r/538976

Krinkle removed a subscriber: Krinkle.Sep 25 2019, 8:21 PM

Change 541619 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile, prometheus: install swagger exporter on icinga

https://gerrit.wikimedia.org/r/541619

Change 542472 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: added swagger exporter jobs at svc endpoints

https://gerrit.wikimedia.org/r/542472

Restricted Application added a subscriber: Masumrezarock100. · View Herald TranscriptOct 11 2019, 5:27 PM

Change 541619 merged by Cwhite:
[operations/puppet@production] profile, prometheus, role: install swagger exporter on prometheus nodes

https://gerrit.wikimedia.org/r/541619

Change 538976 merged by Alexandros Kosiaris:
[operations/puppet@production] hiera: update ores to pass statsd through statsd_exporter

https://gerrit.wikimedia.org/r/538976

colewhite renamed this task from Fully migrate >= 30% of producers off statsd to Fully migrate producers off statsd.Nov 6 2019, 4:37 PM
colewhite updated the task description. (Show Details)Thu, Nov 21, 12:37 AM

Change 535188 merged by Filippo Giunchedi:
[operations/puppet@production] swift: stop relaying to statsd/statsite

https://gerrit.wikimedia.org/r/535188

Change 556052 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores to pass statsd through the statsd exporter

https://gerrit.wikimedia.org/r/556052

Change 556052 merged by Cwhite:
[operations/puppet@production] hiera: update ores to pass statsd through the statsd exporter

https://gerrit.wikimedia.org/r/556052

colewhite updated the task description. (Show Details)Tue, Dec 10, 9:42 PM
colewhite updated the task description. (Show Details)Wed, Dec 11, 4:49 PM
colewhite updated the task description. (Show Details)Wed, Dec 11, 5:03 PM
colewhite updated the task description. (Show Details)Thu, Dec 12, 11:13 PM

Change 556834 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/mobileapps@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/556834

Change 556420 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/services/citoid@master] update service-runner to 2.8.0 and implement new metrics api

https://gerrit.wikimedia.org/r/556420

colewhite updated the task description. (Show Details)Thu, Dec 12, 11:53 PM