This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying `statsd_exporter` to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in {T145867}.
This is an audit on statsd traffic received on graphite host, sorted by "top level". There's some garbage/invalid names too in the list, to be ignored.
{P8742}
Generated with
```
timeout 10m ngrep -q -W byline . udp dst port 8125 | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d. | pigz -9c > statsd_users_10m.gz
zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m
```
Which as it turns out isn't the whole story: looking at **graphite** whisper files mtime a few infrequent statsd producers came up:
```
deploy
scap
gunicorn
```
To the list above of statsd traffic hitting `statsd.eqiad.wmnet` there's a few other producers that go by way of statsd -> statsite (on localhost) -> graphite protocol to graphite1004:
```
thumbor
swift
zuul
```
Annotated list of producers above, with plan of action:
== statsv-produced metrics, see also T180105
metrics are in Prometheus. grafana dashboards need migrated
[] mw.js.deprecate (generated client-side from mediawiki/extensions/WikimediaEvents)
[] mw.performance
[] browsertime (from WebPageReplay)
[] ve
[] Some metrics in MediaWiki hierarchy, e.g. minerva.WebClientError
[] pagepreviews (top level, PagePreviewsApiFailure/ PagePreviewsApiResponse/ PagePreviewsPreviewShow/)
[] media.thumbnail.client
[] webpagetest (generated by wpt-reporter from Jenkins)
[] wikibase.queryService.ui
== navtiming-produced metrics, see also T175087
Prometheus counters for navtiming: https://gerrit.wikimedia.org/r/c/performance/navtiming/+/534771
[] frontend
[] mw.performance.save*
[] eventlogging.client_errors.navigation/paitingtiming
[] performance.survey
== TODO
[x] logstash - turn off statsd-exporter relaying to statsd.eqiad.wmnet + --update alerts--
[x] ores - ~consider native Prometheus support?~ statsd exporter for now
[x] service_checker - the idea is to move to a blackbox_exporter-like model, see also https://gerrit.wikimedia.org/r/c/operations/software/service-checker/+/532807 and https://gerrit.wikimedia.org/g/operations/debs/prometheus-swagger-exporter
[x] swift
[x] thumbor - turn off statsd-exporter relaying to statsd.eqiad.wmnet + update dashboards/alerts
[] gerrit - Emitted by Zuul service. Example usage: https://grafana.wikimedia.org/dashboard/db/releng-gerrit (https://gerrit.wikimedia.org/r/c/operations/puppet/+/479139)
[] zuul - Example: https://grafana.wikimedia.org/dashboard/db/zuul , bottom of https://integration.wikimedia.org/zuul/ (https://gerrit.wikimedia.org/r/c/operations/puppet/+/479139)
[] cloudvps - from `nova_fullstack_test.py` (see also T210850 for more context)
[] scap - from scap
[] deploy - from scap
[] gunicorn - from superset
== Use global aggregation / percentiles
See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic). Some of the services below use `service-runner`, for which some statsd metrics will need reconsideration (cfr {T222795})
[] MediaWiki. Note that this namespace has multiple producerts, not only mediawiki. Some metrics come from statsv (e.g. `MediaWiki.wikibase`), some from `reportupdater-queries` (e.g. `Mediawiki.CodeMirror`)
[] ~~eventbus~~ decommissioned in {T232122}
[] ~~eventlogging.overall.inserted~~ - from eventlogging.handlers sql_writer handler, will be deprecated T159170
Dependent on Service-Runner:
[] aqs [[ https://gerrit.wikimedia.org/r/c/analytics/aqs/+/558696 | PR ]]
[] changeprop/cpjobqueue - scheduled to be moved to k8s [[ https://github.com/wikimedia/change-propagation/pull/334 | PR ]]
[x] eventstreams - scheduled to be moved to k8s [[ https://gerrit.wikimedia.org/r/c/mediawiki/services/eventstreams/+/559568 | PR ]]
[x] eventgate
[] kartotherian, tilerator, tileratorui -- [[ https://gerrit.wikimedia.org/r/c/mediawiki/services/kartotherian/+/556250 | PR ]]
[] mobileapps - some parts are moving to k8s [[ https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/556834 | PR ]]
[x] service-template-node [[ https://github.com/wikimedia/service-template-node/pull/127 | PR ]]
[] proton [[ https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/558213 | PR ]]
[] recommendation-api [[ https://gerrit.wikimedia.org/r/c/mediawiki/services/recommendation-api/+/558184 | PR ]]
[] restbase - (parts?) moving to k8s [[ https://github.com/wikimedia/restbase/pull/1232 | PR ]]
[] hyperswitch [[ https://github.com/wikimedia/hyperswitch/pull/114 | PR ]]
[] citoid -- [[ https://gerrit.wikimedia.org/r/c/mediawiki/services/citoid/+/556420 | PR ]]