This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying `statsd_exporter` to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in {T145867}.
This is an audit on statsd traffic received on graphite host, sorted by "top level". There's some garbage/invalid names too in the list, to be ignored.
{P8742}
Generated with
```
timeout 10m ngrep -q -W byline . udp dst port 8125 | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d. | pigz -9c > statsd_users_10m.gz
zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m
```
Which as it turns out isn't the whole story: looking at **graphite** whisper files mtime a few infrequent statsd producers came up:
```
deploy
scap
gunicorn
```
To the list above of statsd traffic hitting `statsd.eqiad.wmnet` there's a few other producers that go by way of statsd -> statsite (on localhost) -> graphite protocol to graphite1004:
```
thumbor
swift
zuul
```
Annotated list of producers above, with plan of action:
== statsv-produced metrics, see also T180105
[] mw.js.deprecate (generated client-side from mediawiki/extensions/WikimediaEvents)
[] mw.performance
[] browsertime (from WebPageReplay)
[] ve
[] Some metrics in MediaWiki hierarchy, e.g. minerva.WebClientError
[] pagepreviews (top level, PagePreviewsApiFailure/ PagePreviewsApiResponse/ PagePreviewsPreviewShow/)
[] media.thumbnail.client
[] webpagetest (generated by wpt-reporter from Jenkins)
[] wikibase.queryService.ui
== navtiming-produced metrics, see also T175087
Prometheus counters for navtiming: https://gerrit.wikimedia.org/r/c/performance/navtiming/+/534771
[] frontend
[] mw.performance.save*
[] eventlogging.client_errors.navigation/paitingtiming
[] performance.survey
== TODO
[x] logstash - turn off statsd-exporter relaying to statsd.eqiad.wmnet + --update alerts--
[] ores - consider native Prometheus support?
[] service_checker - the idea is to move to a blackbox_exporter-like model, see also https://gerrit.wikimedia.org/r/c/operations/software/service-checker/+/532807 and https://gerrit.wikimedia.org/g/operations/debs/prometheus-swagger-exporter
[x] swift
[x] thumbor - turn off statsd-exporter relaying to statsd.eqiad.wmnet + update dashboards/alerts
[] gerrit - Emitted by Zuul service. Example usage: https://grafana.wikimedia.org/dashboard/db/releng-gerrit (https://gerrit.wikimedia.org/r/c/operations/puppet/+/479139)
[] zuul - Example: https://grafana.wikimedia.org/dashboard/db/zuul , bottom of https://integration.wikimedia.org/zuul/ (https://gerrit.wikimedia.org/r/c/operations/puppet/+/479139)
[] cloudvps - from `nova_fullstack_test.py` (see also T210850 for more context)
[] scap - from scap
[] deploy - from scap
[] gunicorn - from superset
== Use global aggregation / percentiles
See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic). Some of the services below use `service-runner`, for which some statsd metrics will need reconsideration (cfr {T222795})
[] MediaWiki (some metrics come from statsv (e.g. `MediaWiki.wikibase`)
[] ~~eventbus~~ decommissioned in {T232122}
[] ~~eventlogging.overall.inserted~~ - from eventlogging.handlers sql_writer handler, will be deprecated T159170
Dependent on Service-Runner:
[] aqs
[] changeprop - scheduled to be moved to k8s
[] cpjobqueue - scheduled to be moved to k8s
[] eventstreams - scheduled to be moved to k8s
[] graphoid - under code stewardship review
[] kartotherian, tilerator, tileratorui -- [[https://github.com/shdubsh/mediawiki-services-kartotherian/commit/7d81da9007a0dbf76883f4ae9dc7240f404efdc8|commit]]
[] mobileapps - some parts are moving to k8s
[] parsoid - going to be migrated to pure PHP
[] parsoid-tests - going to be migrated to pure PHP
[] proton
[] recommendation-api
[] restbase - (parts?) moving to k8s
[] restbase-dev - (parts?) moving to k8s
[] citoid
[] service-template-node -- [[ https://github.com/wikimedia/service-template-node/pull/127 | PR ]]