Page MenuHomePhabricator

Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats
Closed, ResolvedPublic

Description

Background

service-runner[1] is able to report via the statsd protocol nodejs garbage collection statistics via the gc-stats[2] node module. The module is binary and thus tends to be used only in production where we can be called to debug things and need more observability.

Nodejs GC stats is granular enough to be able to report GC cycle down nanosecond resolution. Service-runner nicely wraps it up in the heapwatch.js part of it.

Issue

However, and per my perhaps flawed understanding, back when that was originally implemented, it was required that percentiles as well as max/mins. The way chosen to obtain that behavior was via an abuse of the timing metric type of statsd. However, per https://github.com/wikimedia/service-runner/commit/b64c697cfe60f618ead890853c3ec012d0c85445 the reported value is in nanoseconds which is in contradiction of the timing statsd metric type which is in milliseconds. This leads into a confusing state where the metric+value is nanoseconds ableit suffixed as milliseconds.

This is very evident when tcpdumping the protocol where you can get thing like the following

cxserver.gc.minor:774416|ms
cxserver.gc.minor:4254832|ms
cxserver.gc.incremental:2101997|ms
cxserver.gc.major:6893031|ms
cxserver.gc.incremental:2326499|ms
cxserver.gc.major:92628052|ms
cxserver.gc.incremental:2311954|ms
cxserver.gc.major:5829335|ms
cxserver.gc.minor:1951233|ms

Lacking the internal knowledge that this are in reality nanoseconds, one just immediately thinks that we have GC cycles that take well over 1 kiloseconds, which is clearly false.

Adding to the confusion is the fact that we 've adopted prometheus-statsd-exporter[3], which, in order to stay consistent with the prometheus standards divides timing statsd metric type with 1000 in order to get seconds[4], [5].

Current state

Given that:

  1. we are moving forward with the deprecation of posting statsd metrics to graphite and moving into fully adopting prometheus-statsd-exporter
  2. That prometheus allows calculating percentiles and max/mins easily, especially if the metric type is a histogram
  3. That prometheus-statsd-exporter allows arbitrarily mapping statsd metric types to prometheus statsd types

we could conceivably move the percentile calculation part in prometheus and stop abusing the timing statsd metric type.

Options

I see a number of options:

  1. We leave things as is
  2. Divide with 10^6 in service-runner before reporting the number.
  3. Switch the metric type to a gauge
  4. Switch the metric type to a counter, with each service-runner instance reporting the cummulative GC time for each uptime
  5. Start implementing native prometheus metrics support in service-runner

There are a number of pros and cons for each approach:

We leave things as is

Pros:

  • Noting to be done, except perhaps document this a bit better

Cons

  • People will still be occasionally confused, we will be explaining the weird behavior and adapt our graphs and prometheus metrics

Divide with 10^6 in service-runner before reporting the number.

Pros

  • We decrease confusion and actually correctly use the metric type.
  • prometheus-statsd-exporter reports seconds correctly, meaning no need to update anything related to it

Cons

  • We lose granularity as quite a few GC cycles seems to be on the 0.5-5ms mark.

Some work to adopt the newer approach, dashboards to be updated etc.

Switch the metric type to a gauge

Pros

  • We decrease confusion
  • We have a metric type that is unsuffixed and hence can be used in any way we want
  • We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries

Cons

  • If we need to gracefully roll this out, some work needs to be done, perhaps introducing a new metric type before killing the old one
  • statsd-exporter mappings will have to written to move GC stats to histograms to retain the ability to calculate percentiles
  • Dashboards will have to be updated

Switch the metric type to a counter, with each service-runner instance reporting the cummulative GC time for each uptime

Pros

  • We decrease confusion
  • We have a metric type that is unsuffixed and hence can be used in any way we want
  • We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries

Cons

  • If we need to gracefully roll this out, some work needs to be done, perhaps introducing a new metric type before killing the old one
  • statsd-exporter mappings will have to written to move GC stats to histograms to retain the ability to calculate percentiles
  • Dashboards will have to be updated
  • We need to have state about gc timings in the application. This will require some rearchitecting of service-runner probably and given the ephemerous nature of the workers it will mean the statsd stats will need to be reporter by the master.

Start implementing native prometheus metrics support in service-runner

Pros

  • We decrease confusion
  • We have native prometheus support
  • We have a metric type that is unsuffixed and hence can be used in any way we want
  • We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries
  • Can be rolled out gracefully alongside the old statsd system
  • No prometheus-statsd-exporter need.

Cons

  • Dashboards will have to be updated
  • We need to have state about gc timings in the application. This will require some rearchitecting of service-runner probably and given the ephemerous nature of the workers it will probably mean that statsd will need to be stored and served by the master.
  • It's out of scope of this task and it's some large work.

[1] https://github.com/wikimedia/service-runner
[2] https://www.npmjs.com/package/gc-stats
[3] https://github.com/prometheus/statsd_exporter
[4] https://github.com/prometheus/statsd_exporter/blob/e3d6050616f7e34c0c74fa03fedb4e119e6f37b5/exporter.go#L472
[5] https://github.com/prometheus/statsd_exporter/blob/e3d6050616f7e34c0c74fa03fedb4e119e6f37b5/exporter.go#L488

Event Timeline

akosiaris renamed this task from Re-evaluate service-runner's (ab) of statsd timing metric for nodejs GC stats to Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.May 8 2019, 3:01 PM
Pchelolo edited projects, added Services (later); removed Services.
Pchelolo subscribed.

Thank you for an impressive level of details :) There's a bunch of other places where we abuse the timing metric within services exactly for the reason that we've needed to have percentiles, so the decision we make here should probably be adopted elsewhere.

I personally think that "3. Switch the metric type to a gauge" is the best way forward and it's the most reasonable compromise. Implementing native Prometheus support is the perfect solution, but it's a fairly hard project and would also require rewriting a lot of service code since it will likely be API-incompatible with current metrics object in JS. I believe we should do this eventually, just maybe not in this moment.

Using gauges here makes us dependent on prometheus-metrics-exporter. For services already switched to k8s that's fine, but for services not yet on k8s that would break the metrics - we wouldn't be able to get percentiles anymore. So maybe wait till all the services are on k8s?

Thank you for an impressive level of details :) There's a bunch of other places where we abuse the timing metric within services exactly for the reason that we've needed to have percentiles, so the decision we make here should probably be adopted elsewhere.

TIL. Thanks for that info. And yes, I agree.

I personally think that "3. Switch the metric type to a gauge" is the best way forward and it's the most reasonable compromise.

+1

Implementing native Prometheus support is the perfect solution, but it's a fairly hard project and would also require rewriting a lot of service code since it will likely be API-incompatible with current metrics object in JS. I believe we should do this eventually, just maybe not in this moment.

That's my opinion as well.

Using gauges here makes us dependent on prometheus-metrics-exporter. For services already switched to k8s that's fine, but for services not yet on k8s that would break the metrics - we wouldn't be able to get percentiles anymore. So maybe wait till all the services are on k8s?

Why would that be? We have profile::prometheus::statsd_exporter in puppet, which runs the exact same software (up to the version) in our puppet managed hosts/services as well. We would need some mappings rules [1] but we are anyway going to need them for the move to k8s. The caveat would be we would have to update all dashboards for all services residing on the same host cluster (scb being the problem here) pretty much on the same timeframe, mostly due to the fact statsd configuration is the same for all services on the same host.

In any case, I don't think we are in any kind of rush. We 've survived with this for years, we can survive more.

[1] https://github.com/prometheus/statsd_exporter#metric-mapping-and-configuration

The caveat would be we would have to update all dashboards for all services residing on the same host cluster (scb being the problem here) pretty much on the same timeframe, mostly due to the fact statsd configuration is the same for all services on the same host.

Given that GC metrics and metrics that abuse timings are mostly informational, there are no alerts on them, I think it's not a problem to break them for a little bit while the dashboards are being updated.

Change 509811 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] cxserver: Switch GC stats back to microseconds

https://gerrit.wikimedia.org/r/509811

Change 509811 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] cxserver: Switch GC stats back to microseconds

https://gerrit.wikimedia.org/r/509811

Change 509836 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventgate: Switch GC metric to microseconds, update buckets

https://gerrit.wikimedia.org/r/509836

Change 509836 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] eventgate: Switch GC metric to microseconds, update buckets

https://gerrit.wikimedia.org/r/509836

Change 517803 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] citoid, mathoid, termbox: Switch GC metric to microseconds

https://gerrit.wikimedia.org/r/517803

Change 517803 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] citoid, mathoid, termbox: Switch GC metric to microseconds

https://gerrit.wikimedia.org/r/517803

Mentioned in SAL (#wikimedia-operations) [2019-06-19T08:21:19Z] <akosiaris> upgrade citoid, mathoid, termbox to latest chart releases to address the GC metric naming issue T220709 T222795

I believe this is now (partially?) done, and service-runner supports Prometheus natively these days. What do you think @akosiaris ?

I believe this is now (partially?) done, and service-runner supports Prometheus natively these days. What do you think @akosiaris ?

I think we require @Ottomata's input on https://github.com/wikimedia/service-runner/pull/245 first. But yes, it does look very close to being resolved.

I had to go back and check, but both eventgate and evenstreams are using service-runner prometheus directly, and no longer using prometheus-statsd-exporter.

T205870: Fully migrate producers off statsd

Oh, is that not related? Anyway, I'm not aware of any alerts on GC stats, and at the very worst we'll have to adjust the grafana panel next time we look at it.

Proceed!

Oh, is that not related? Anyway, I'm not aware of any alerts on GC stats, and at the very worst we'll have to adjust the grafana panel next time we look at it.

Proceed!

Perfect, thanks!

akosiaris claimed this task.

The patch has been merged but there isn't much point in tracking the adoption per service in this task, so resolving. We should open a different tracking task to encourage the services to be upgraded to service-runner 3.1.0.