as seen on T89846 txstatsd can't cope with our metric load anymore and the graphite machine just drops packets when txstatsd is maxed out (see also https://wikitech.wikimedia.org/wiki/Graphite/Scaling#performance:_change_statsd_daemon and related)
aside with the practicalities of switching statsd daemon, I think it makes sense to offload statsd work to a local daemon running on localhost. The result metrics are then exposed via e.g. http and can be scraped (i.e. collected/aggregated by scrapers) and pushed to metric storage (e.g. graphite) for consultation, graphs, etc.
turning from pushing metrics to pulling metrics provides some scaling and management advantages:
- no flood of incoming (unaggregated or otherwise) metrics to a central point but collected at a specified pace by scrapers/pollers
- easier to aggregate into different groups (the scrapers can map a set of hosts to a certain cluster and aggregate/rewrite metrics accordingly)
- e.g. host1 host2 host3 are running service y, collect metrics from those and aggregate/prefix them with "servicey"
- easy to have multiple consumers: have another poller (per team, per service, etc)
- free batching of metrics, a single "scrape" or poll for metrics returns all metrics in a single tcp session
just an outline, a more detailed plan needs to be drafted though