Page MenuHomePhabricator

Provision >= 50% of statsd/Graphite-only metrics in Prometheus
Open, NormalPublic

Description

This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying statsd_exporter to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in T145867: Test making thumbor statsd metrics available from Prometheus.

This is an audit on statsd traffic received on graphite1001 for ten minutes, sorted by "top level". There's some garbage/invalid names too in the list, to be ignored.

# cat top_users_10m | grep -v PagePrev
202906054 MediaWiki
62807151 restbase
24205989 varnish
13457335 changeprop
5280308 ores
4256639 aqs
3800019 cpjobqueue
1961745 eventbus
1203080 kafka
 905410 mobileapps
 800150 varnishkafka
 771824 parsoid
 560644 tilerator
 461879 frontend
 248861 logstash
 145168 kartotherian
  48707 parsoid-tests
  32820 cxserver
  27991 graphoid
  25956 eventstreams
  19591 citoid
  18083 service_checker
  15730 recommendation-api
  13784 restbase-dev
  11832 mathoid
   9362 zuul
   8419 proton
   5713 ve
   4314 mw
   1526 media
    447 
    357 tileratorui
    348 swift
    305 eventlogging
    236 wikibase
    220 browsertime
    199 webpagetest
    145 gerrit
     40 varnis
     37 servers
     34 varn
     19 varni
     17 var
      7 }:1|c
      7     this
      7     return this;
      5 performance
      3 v

Generated with

timeout 10m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d.  | pigz -9c > statsd_users_10m.gz

zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m

Annotated list of producers above, with plan of action

Moving to k8s?

  • mathoid

Can be ignored, legacy and/or to be deprecated

  • kafka (only analytics-eqiad stats left here, mediawiki is its only client now T152015)
  • varnish (to be deprecated T184942)
  • varnishkafka (to be deprecated T196066)

Will move to Prometheus anyway (?)

  • frontend (generated by navtiming, see also T175087)
  • mw.performance (generated by navtiming, see also T175087)

TODO

Use global aggregation / percentiles

See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic)

  • MediaWiki (some metrics come from statsv (e.g. MediaWiki.wikibase)
  • aqs
  • browsertime (via statsv; generated by WebPageReplay)
  • changeprop
  • citoid
  • cpjobqueue
  • cxserver
  • eventbus
  • eventlogging
  • eventstreams
  • graphoid
  • kartotherian
  • media (via statsv)
  • mobileapps
  • parsoid
  • parsoid-tests
  • proton
  • recommendation-api
  • restbase
  • restbase-dev
  • servers (used for servers.labnet1001.nova, otherwise to be deprecated when Diamond is decom)
  • tilerator
  • tileratorui
  • ve (via statsv)
  • webpagetest (via statsv; generated by wpt-reporter from Jenkins)
  • wikibase (via statsv)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 465414 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/debs/prometheus-statsd-exporter@master] debian: add patch for inline udp usage

https://gerrit.wikimedia.org/r/465414

fgiunchedi updated the task description. (Show Details)Oct 9 2018, 1:27 PM

Change 465428 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP: statsd_exporter

https://gerrit.wikimedia.org/r/465428

Change 465608 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP: add statsd-exporter to thumbor

https://gerrit.wikimedia.org/r/465608

Change 465414 merged by Filippo Giunchedi:
[operations/debs/prometheus-statsd-exporter@master] debian: add patch for inline udp usage

https://gerrit.wikimedia.org/r/465414

Mentioned in SAL (#wikimedia-operations) [2018-10-15T13:02:36Z] <godog> upload prometheus-statsd-exporter 0.7.0 - T205870

Change 465428 merged by Filippo Giunchedi:
[operations/puppet@production] New class: prometheus::statsd_exporter

https://gerrit.wikimedia.org/r/465428

Change 465608 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: add prometheus-statsd-exporter

https://gerrit.wikimedia.org/r/465608

Change 467659 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] statsd_exporter: fix commandline flags

https://gerrit.wikimedia.org/r/467659

Change 467659 merged by Filippo Giunchedi:
[operations/puppet@production] statsd_exporter: fix commandline flags

https://gerrit.wikimedia.org/r/467659

Mentioned in SAL (#wikimedia-operations) [2018-10-17T14:33:39Z] <godog> upload prometheus-statsd-exporter 0.7.0+ds1-2 - T205870

Change 467980 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: fix missing statsd_exporter mappings

https://gerrit.wikimedia.org/r/467980

Change 467980 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: fix missing statsd_exporter mappings

https://gerrit.wikimedia.org/r/467980

Change 467986 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: set name for all statsd_exporter metrics

https://gerrit.wikimedia.org/r/467986

Change 467986 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: set name for all statsd_exporter metrics

https://gerrit.wikimedia.org/r/467986

Change 467988 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: use statsd_exporter

https://gerrit.wikimedia.org/r/467988

Change 467988 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: use statsd_exporter

https://gerrit.wikimedia.org/r/467988

Mentioned in SAL (#wikimedia-operations) [2018-10-23T09:13:41Z] <godog> roll-restart thumbor to send statsd traffic through statsd_exporter - T205870

Change 469179 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: add missing statsd_exporter mappings

https://gerrit.wikimedia.org/r/469179

Change 469179 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: add missing statsd_exporter mappings

https://gerrit.wikimedia.org/r/469179

Change 469182 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add jobs for statsd_exporter

https://gerrit.wikimedia.org/r/469182

Change 469182 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add jobs for statsd_exporter

https://gerrit.wikimedia.org/r/469182

Change 469200 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: set defaults for statsd_exporter

https://gerrit.wikimedia.org/r/469200

Change 469200 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: set defaults for statsd_exporter

https://gerrit.wikimedia.org/r/469200

Change 470830 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: enable statsd_exporter

https://gerrit.wikimedia.org/r/470830

Change 470830 merged by Filippo Giunchedi:
[operations/puppet@production] swift: enable statsd_exporter

https://gerrit.wikimedia.org/r/470830

Change 470873 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add statsd_exporter mappings for swift-proxy

https://gerrit.wikimedia.org/r/470873

Change 470874 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: set statsd_exporter to relay to local statsd

https://gerrit.wikimedia.org/r/470874

Change 470873 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add statsd_exporter mappings for swift-proxy

https://gerrit.wikimedia.org/r/470873

Change 470874 merged by Filippo Giunchedi:
[operations/puppet@production] swift: set statsd_exporter to relay to local statsd

https://gerrit.wikimedia.org/r/470874

Change 471292 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: rename swift proxy statsd_exporter mapping

https://gerrit.wikimedia.org/r/471292

Change 471293 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add statsd_exporter mappings for swift::storage

https://gerrit.wikimedia.org/r/471293

Change 471292 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: rename swift proxy statsd_exporter mapping

https://gerrit.wikimedia.org/r/471292

Change 471293 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add statsd_exporter mappings for swift::storage

https://gerrit.wikimedia.org/r/471293

fgiunchedi updated the task description. (Show Details)Nov 12 2018, 11:00 AM

Change 472986 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: add statsd_port parameter

https://gerrit.wikimedia.org/r/472986

Change 472986 merged by Filippo Giunchedi:
[operations/puppet@production] swift: add statsd_port parameter

https://gerrit.wikimedia.org/r/472986

Change 472996 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: relay statsd_exporter metrics to localhost

https://gerrit.wikimedia.org/r/472996

Change 472996 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: relay statsd_exporter metrics to localhost

https://gerrit.wikimedia.org/r/472996

Change 473006 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: turn on statsd_exporter in codfw

https://gerrit.wikimedia.org/r/473006

Change 473006 merged by Filippo Giunchedi:
[operations/puppet@production] swift: turn on statsd_exporter in codfw

https://gerrit.wikimedia.org/r/473006

Change 473519 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: turn on statsd_exporter in eqiad

https://gerrit.wikimedia.org/r/473519

Change 473519 merged by Filippo Giunchedi:
[operations/puppet@production] swift: turn on statsd_exporter in eqiad

https://gerrit.wikimedia.org/r/473519

Change 473704 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: add statsd mappings for periodic metrics

https://gerrit.wikimedia.org/r/473704

Change 473704 merged by Filippo Giunchedi:
[operations/puppet@production] swift: add statsd mappings for periodic metrics

https://gerrit.wikimedia.org/r/473704

fgiunchedi updated the task description. (Show Details)Nov 15 2018, 9:53 AM

Change 473709 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP: hieradata switch all swift statsd traffic to statsd_exporter

https://gerrit.wikimedia.org/r/473709

Change 473709 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: switch all swift statsd traffic to statsd_exporter

https://gerrit.wikimedia.org/r/473709

Change 473729 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: send periodic swift stats to localhost

https://gerrit.wikimedia.org/r/473729

Change 473729 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: send periodic swift stats to localhost

https://gerrit.wikimedia.org/r/473729

Change 474125 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] statsite: move from role to profile

https://gerrit.wikimedia.org/r/474125

Change 474128 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128

Change 474125 merged by Filippo Giunchedi:
[operations/puppet@production] statsite: move from role to profile

https://gerrit.wikimedia.org/r/474125

Change 474128 merged by Filippo Giunchedi:
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128

hashar added a subscriber: hashar.EditedNov 16 2018, 3:31 PM

The gerrit. metrics are actually reported by the Zuul service on contint1001. It corresponds to events received by Zuul from Gerrit (patchsets, changes, comments...) The related Grafana board is https://grafana.wikimedia.org/dashboard/db/releng-gerrit

hashar updated the task description. (Show Details)Nov 16 2018, 3:32 PM
Krinkle updated the task description. (Show Details)Dec 9 2018, 2:26 AM
Krinkle updated the task description. (Show Details)Dec 9 2018, 2:29 AM
Krinkle added a project: Performance-Team.EditedDec 9 2018, 2:35 AM
Krinkle added subscribers: Peter, Krinkle.
  • performance (generated from where?)

@Peter When I looked at the data under performance.* on the Graphite server, it looks like it might be unused from an early trial and is now under browsertime.* instead. Is that correct? It contains two logical metrics, not updated since the Graphite upgrade on Aug 23 (reset the modified dates)

- performance
  - webpagereplay
    - BarackObama.SpeedIndex
     - max/min/median/mdev
  - browsertime
    - BarackObama.SpeedIndex
     - max/min/median/mdev

Change 479139 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] ci: define statsd prometheus exporter mappings

https://gerrit.wikimedia.org/r/479139

Change 479353 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: enable statsd_exporter and add matching rules to logstash::collector

https://gerrit.wikimedia.org/r/479353

Change 479563 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: enable statsd_exporter and add matching rules to ores::worker

https://gerrit.wikimedia.org/r/479563

hashar removed a subscriber: hashar.Dec 14 2018, 9:16 AM
Peter added a comment.Dec 17 2018, 2:16 PM

Yes, - browsertime is the correct one, the other one we should remove.

fgiunchedi updated the task description. (Show Details)Dec 17 2018, 2:17 PM

Change 479563 merged by Cwhite:
[operations/puppet@production] profile: enable statsd_exporter and add matching rules to ores::worker

https://gerrit.wikimedia.org/r/479563

Change 480259 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] proton: enable statsd_exporter and add matching rules to profile::proton

https://gerrit.wikimedia.org/r/480259

Change 479353 merged by Cwhite:
[operations/puppet@production] profile: enable statsd_exporter and add matching rules to logstash::collector

https://gerrit.wikimedia.org/r/479353

colewhite updated the task description. (Show Details)Dec 18 2018, 10:55 PM
colewhite updated the task description. (Show Details)
CDanis added a subscriber: CDanis.Dec 19 2018, 4:57 PM

Change 480943 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: output webrequest 5xx metrics

https://gerrit.wikimedia.org/r/480943

Change 481110 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] mediawiki: enable statsd_exporter and add matching rules to appserver

https://gerrit.wikimedia.org/r/481110

Change 482350 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] varnish: enable statsd_exporter and add matching rules

https://gerrit.wikimedia.org/r/482350

Change 482718 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/deployment-charts@master] add statsd_exporter config to mathoid

https://gerrit.wikimedia.org/r/482718

colewhite updated the task description. (Show Details)Jan 11 2019, 5:45 PM
fgiunchedi updated the task description. (Show Details)Jan 14 2019, 10:21 AM

Change 484586 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] scb: enable statsd_exporter and add matching rules

https://gerrit.wikimedia.org/r/484586

Change 482718 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] add statsd_exporter config to mathoid

https://gerrit.wikimedia.org/r/482718

Status update: 4 services (swift / ores / thumbor / logstash) have their metrics collected by Prometheus by virtue of using statsd_exporter out of 40, so 10% ATM. mathoid on k8s is on its way to have Prometheus metrics too.

For service owners moving their services to k8s and based on the experience in this task we've developed guidelines on how to write statsd_exporter mappings at https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s which will need more feedback/scrutiny.

Mediawiki remains the biggest producer of statsd traffic ATM and contains a multitude of metrics, sometimes inconsistent, as highlighted by Krinkle on https://gerrit.wikimedia.org/r/c/operations/puppet/+/481110 we'll need to think of a more streamlined approach to tackle mw metrics for Prometheus.

fgiunchedi moved this task from In progress to Up next on the observability board.Mar 18 2019, 1:58 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-20T10:26:57Z] <godog> reimage prometheus1003 with stretch - T205870