Page MenuHomePhabricator

scale statsd reporting/aggregation (plan)
Closed, InvalidPublic

Description

as seen on T89846 txstatsd can't cope with our metric load anymore and the graphite machine just drops packets when txstatsd is maxed out (see also https://wikitech.wikimedia.org/wiki/Graphite/Scaling#performance:_change_statsd_daemon and related)

aside with the practicalities of switching statsd daemon, I think it makes sense to offload statsd work to a local daemon running on localhost. The result metrics are then exposed via e.g. http and can be scraped (i.e. collected/aggregated by scrapers) and pushed to metric storage (e.g. graphite) for consultation, graphs, etc.

turning from pushing metrics to pulling metrics provides some scaling and management advantages:

  • no flood of incoming (unaggregated or otherwise) metrics to a central point but collected at a specified pace by scrapers/pollers
  • easier to aggregate into different groups (the scrapers can map a set of hosts to a certain cluster and aggregate/rewrite metrics accordingly)
    • e.g. host1 host2 host3 are running service y, collect metrics from those and aggregate/prefix them with "servicey"
  • easy to have multiple consumers: have another poller (per team, per service, etc)
  • free batching of metrics, a single "scrape" or poll for metrics returns all metrics in a single tcp session

just an outline, a more detailed plan needs to be drafted though

Event Timeline

fgiunchedi claimed this task.
fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: Grafana, acl*sre-team.
fgiunchedi added subscribers: fgiunchedi, Aklapper, ori.
fgiunchedi renamed this task from scale statsd reporting/aggregation to scale statsd reporting/aggregation (plan).Feb 20 2015, 10:59 AM
fgiunchedi set Security to None.

@fgiunchedi and I had a long involved discussion about options today :) I think we both came away with a bit to think on.

@chasemp: Do you mind summarizing the main questions / options that you identified?

btw my mental model poorly described above can be summarized as:

  • stop having every machine push single clients independently to a single central point
  • instead, expose those metrics on the client itself via e.g. http
    • meaning that a single http call would return all metrics and their associated values as they current are
    • no state/backfill is kept by the client
    • by "client" I broadly mean anything that would serve metrics over

http, i.e. address + port

  • poll the clients at whichever interval is desired
    • a crucial difference is that in this case no rates would be pushed but increasing counters in their place, rates can be calculated afterwards

a better and longer description of what I'm talking about is http://prometheus.io/docs/concepts/data_model/ and http://prometheus.io/docs/concepts/metric_types/

also @chasemp was wondering if our current usage of (tx)statsd makes sense, I collected one minute of statsd traffic on graphite1001 and these are a few of the top senders

  1036 ms-be2003.codfw.wmnet
  1036 ms-be2004.codfw.wmnet
  1036 ms-be2005.codfw.wmnet
  1036 ms-be2006.codfw.wmnet
  1036 ms-be2010.codfw.wmnet
  1036 ms-be2011.codfw.wmnet
  1059 ms-be1013.eqiad.wmnet
  1060 ms-be1002.eqiad.wmnet
  1060 ms-be1004.eqiad.wmnet
  1060 ms-be1008.eqiad.wmnet
  1060 ms-be1009.eqiad.wmnet
  1060 ms-be1010.eqiad.wmnet
  1060 ms-be1014.eqiad.wmnet
  1061 ms-be1003.eqiad.wmnet
  1061 ms-be1005.eqiad.wmnet
  1061 ms-be1006.eqiad.wmnet
  1061 ms-be1007.eqiad.wmnet
  1061 ms-be1011.eqiad.wmnet
  1062 ms-be1001.eqiad.wmnet
  1062 ms-be1015.eqiad.wmnet
  1063 ms-be2008.codfw.wmnet
  1063 ms-be2012.codfw.wmnet
  1463 tmh1001.eqiad.wmnet
  1466 tmh1002.eqiad.wmnet
  1882 labstore1001.eqiad.wmnet
  2118 gallium.wikimedia.org
  2483 ocg1002.eqiad.wmnet
  2536 ocg1001.eqiad.wmnet
  2822 mw1009.eqiad.wmnet
  2823 analytics1018.eqiad.wmnet
  2963 mw1004.eqiad.wmnet
  2991 mw1008.eqiad.wmnet
  3058 analytics1012.eqiad.wmnet
  3058 analytics1021.eqiad.wmnet
  3058 analytics1022.eqiad.wmnet
  3091 mw1010.eqiad.wmnet
  3324 mw1005.eqiad.wmnet
  3326 mw1007.eqiad.wmnet
  3390 mw1014.eqiad.wmnet
  3445 mw1011.eqiad.wmnet
  3454 mw1012.eqiad.wmnet
  3455 mw1006.eqiad.wmnet
  3468 mw1001.eqiad.wmnet
  3517 mw1015.eqiad.wmnet
  3554 mw1016.eqiad.wmnet
  3557 mw1002.eqiad.wmnet
  3560 mw1013.eqiad.wmnet
  3561 mw1003.eqiad.wmnet
 21218 hafnium.wikimedia.org
 24238 restbase1003.eqiad.wmnet
372202 restbase1006.eqiad.wmnet
600128 restbase1004.eqiad.wmnet
692100 restbase1001.eqiad.wmnet
776171 restbase1005.eqiad.wmnet
788920 restbase1002.eqiad.wmnet

Another option:

  • Use the statsd proxy to hash packets to specific backend instances, so that each backend statsd exclusively aggregates some set of metrics. The proxy uses node's cluster module to leverage all cores of a box. It can also be scaled horizontally with LVS in front if that becomes necessary.
  • Use plain etsy statsd or statsite for the backend statsds, and have those feed their respective metrics into graphite.

Advantages I see with this solution:

  • simple; existing code & statsd puppetization
  • avoids the aggregation issues around statsd stacking or multiple statsds feeding into the same metric
  • horizontally scalable
  • preserves ease of use

Another option:

  • Use the statsd proxy to hash packets to specific backend instances, so that each backend statsd exclusively aggregates some set of metrics. The proxy uses node's cluster module to leverage all cores of a box. It can also be scaled horizontally with LVS in front if that becomes necessary.
  • Use plain etsy statsd or statsite for the backend statsds, and have those feed their respective metrics into graphite.

Advantages I see with this solution:

  • simple; existing code & statsd puppetization
  • avoids the aggregation issues around statsd stacking or multiple statsds feeding into the same metric
  • horizontally scalable
  • preserves ease of use

https://github.com/deseretdigital/statsite-proxy exists too, so the same arguments apply, modulo the existing Puppetization -- but Puppetizing Statsite is easy.

Fwiw, I tested the statsd proxy on ruthenium, with txstatsd as a backend, and it seems to be working fine. I disabled health checks for now, although it seems that there is some support for echo messages in txstatsd, so could re-add that.

CPU usage is about 40% in each of 2 workers with all restbase load test output hitting it. (more if no IP specified, as there's no dns caching by default)

This is less efficient than Ori's statsdlb, which uses ~40% cpu on one core to load-balance the same traffic (actually restbase -> statsdlb -> statsd proxy -> txstatsd).

So, several options. The statsd proxy looks more scalable at this point as it can spawn more workers, but for current traffic levels statsdlb should be sufficient too.

both seem fine to me as an approach to get us out of the woods in the short term

correction, statsd proxy won't split on \n differently than statsdlb. I think we're good to go with statsdlb @ori (context is that I'll be back from vacation on March 16th so if you want to take a stab that works for me)

OK, we're out of the woods.

statsdlb is provisioned on graphite1001. It is listening on port 8125 and forwarding to ten txstatsd instances on ports 8126 to 8135.

Making the txstatsd puppet module capable of managing multiple instances requires more effort than I was willing to invest in a stack that is quite possibly moribund, so I removed the txstatsd role from graphite1001, and wrote some custom Puppet code for managing multiple txstatsd instances into the statsdlb module.

I added a script for controlling the txstatsd backends (txstatsdctl -- now say that three times fast!) and service checks for both the txstatsd backends and the statsdlb proxy.

The individual txstatsd backends are all at 80-90% CPU so even with 10 instances we're still operating without a ton of headroom. When Filippo gets back, we'll iterate on the entire stack to make it scalable. In the meantime, the current setup should hold up well.

fgiunchedi changed the task status from Open to Stalled.Apr 16 2015, 8:54 AM

stalling this for now, getting graphite more reliable is a priority on T85451

https://gerrit.wikimedia.org/r/#/c/226639/6 adds a (somewhat awkward) workaround via StatsD request sampling (in MediaWiki). Per IRC discussion with Ori, use of this workaround should be generally avoided, at least until we understand better the exact limitations of the current setup.

In T89857#1519939, @Tgr wrote:

https://gerrit.wikimedia.org/r/#/c/226639/6 adds a (somewhat awkward) workaround via StatsD request sampling (in MediaWiki). Per IRC discussion with Ori, use of this workaround should be generally avoided, at least until we understand better the exact limitations of the current setup.

the biggest limitation afaict for inbound statsd traffic is the fact that we are receiving it on a single machine for aggregation purposes. we're scaling statsd that on graphite1001 with statsdlb that @ori wrote which consistent hashes metrics to multiple statsite processes which then do the aggregation.

on the application side we're often sending a single sample per udp packet, which is obviously inefficient and involves e.g. dns resolution and actually writing the sample over the network for each sample

edit: add application side

We're sunsetting Graphite (e.g. T228380: Tech debt: sunsetting of Graphite) so resolving as invalid