This task tracks porting statsd metrics traffic to Prometheus. Specifically either porting the applications to use native Prometheus metrics or deploying statsd_exporter to expose Prometheus metrics derived from statsd traffic. The latter approach has been tested successfully for Thumbor in T145867: Test making thumbor statsd metrics available from Prometheus.
An audit can be generated with
timeout 10m ngrep -q -W byline . udp dst port 8125 | grep -v -e '^U ' -e '^$' | cut -f1,2,3 -d. | pigz -9c > statsd_users_10m.gz zcat statsd_users_10m.gz | cut -d. -f1 | sort | uniq -dc | sort -rn > top_users_10m
Looking at graphite whisper files mtime a few infrequent statsd producers came up:
deploy scap gunicorn
There's a few other producers that go by way of statsd -> statsite (on localhost) -> graphite protocol to graphite1004:
thumbor swift zuul
Annotated list of producers above, with plan of action:
statsv-produced metrics, see also T180105
metrics are in Prometheus. grafana dashboards need migrated
- mw.js.deprecate (generated client-side from mediawiki/extensions/WikimediaEvents)
- mw.performance
- browsertime (from WebPageReplay)
- ve
- Some metrics in MediaWiki hierarchy, e.g. minerva.WebClientError
- pagepreviews (top level, PagePreviewsApiFailure/ PagePreviewsApiResponse/ PagePreviewsPreviewShow/)
- media.thumbnail.client
- webpagetest (generated by wpt-reporter from Jenkins)
- wikibase.queryService.ui
navtiming-produced metrics, see also T175087
Prometheus counters for navtiming: https://gerrit.wikimedia.org/r/c/performance/navtiming/+/534771
- frontend
- mw.performance.save*
- eventlogging.client_errors.navigation/paitingtiming
- performance.survey
TODO
- logstash - turn off statsd-exporter relaying to statsd.eqiad.wmnet + --update alerts--
- ores - ~consider native Prometheus support?~ statsd exporter for now
- service_checker - the idea is to move to a blackbox_exporter-like model, see also https://gerrit.wikimedia.org/r/c/operations/software/service-checker/+/532807 and https://gerrit.wikimedia.org/g/operations/debs/prometheus-swagger-exporter
- swift
- thumbor - turn off statsd-exporter relaying to statsd.eqiad.wmnet + update dashboards/alerts
- gerrit - Emitted by Zuul service. Example usage: https://grafana.wikimedia.org/dashboard/db/releng-gerrit (https://gerrit.wikimedia.org/r/c/operations/puppet/+/479139)
- zuul - Example: https://grafana.wikimedia.org/dashboard/db/zuul , bottom of https://integration.wikimedia.org/zuul/ (https://gerrit.wikimedia.org/r/c/operations/puppet/+/479139)
- cloudvps - from nova_fullstack_test.py (see also T210850 for more context)
- scap - from scap
- deploy - from scap
- gunicorn - from superset
Use global aggregation / percentiles
See also https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s for an introduction for service owners on how to write their statsd_exporter mappings (in k8s, but guidelines are generic). Some of the services below use service-runner, for which some statsd metrics will need reconsideration (cfr T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats)
- MediaWiki. Note that this namespace has multiple producerts, not only mediawiki. Some metrics come from statsv (e.g. MediaWiki.wikibase), some from reportupdater-queries (e.g. Mediawiki.CodeMirror)
-
eventbusdecommissioned in T232122: Decomission eventlogging-service-eventbus and clean up related configs and code -
eventlogging.overall.inserted- from eventlogging.handlers sql_writer handler, will be deprecated T159170
Dependent on Service-Runner:
- aqs PR
- changeprop/cpjobqueue - scheduled to be moved to k8s PR
- eventstreams - scheduled to be moved to k8s PR
- eventgate
- kartotherian, tilerator, tileratorui -- PR
- mobileapps - some parts are moving to k8s PR
- service-template-node PR
- proton PR
- recommendation-api PR
- restbase - (parts?) moving to k8s PR
- hyperswitch PR
- citoid -- PR
- mathoid -- PR