Background
service-runner[1] is able to report via the statsd protocol nodejs garbage collection statistics via the gc-stats[2] node module. The module is binary and thus tends to be used only in production where we can be called to debug things and need more observability.
Nodejs GC stats is granular enough to be able to report GC cycle down nanosecond resolution. Service-runner nicely wraps it up in the heapwatch.js part of it.
Issue
However, and per my perhaps flawed understanding, back when that was originally implemented, it was required that percentiles as well as max/mins. The way chosen to obtain that behavior was via an abuse of the timing metric type of statsd. However, per https://github.com/wikimedia/service-runner/commit/b64c697cfe60f618ead890853c3ec012d0c85445 the reported value is in nanoseconds which is in contradiction of the timing statsd metric type which is in milliseconds. This leads into a confusing state where the metric+value is nanoseconds ableit suffixed as milliseconds.
This is very evident when tcpdumping the protocol where you can get thing like the following
cxserver.gc.minor:774416|ms cxserver.gc.minor:4254832|ms cxserver.gc.incremental:2101997|ms cxserver.gc.major:6893031|ms cxserver.gc.incremental:2326499|ms cxserver.gc.major:92628052|ms cxserver.gc.incremental:2311954|ms cxserver.gc.major:5829335|ms cxserver.gc.minor:1951233|ms
Lacking the internal knowledge that this are in reality nanoseconds, one just immediately thinks that we have GC cycles that take well over 1 kiloseconds, which is clearly false.
Adding to the confusion is the fact that we 've adopted prometheus-statsd-exporter[3], which, in order to stay consistent with the prometheus standards divides timing statsd metric type with 1000 in order to get seconds[4], [5].
Current state
Given that:
- we are moving forward with the deprecation of posting statsd metrics to graphite and moving into fully adopting prometheus-statsd-exporter
- That prometheus allows calculating percentiles and max/mins easily, especially if the metric type is a histogram
- That prometheus-statsd-exporter allows arbitrarily mapping statsd metric types to prometheus statsd types
we could conceivably move the percentile calculation part in prometheus and stop abusing the timing statsd metric type.
Options
I see a number of options:
- We leave things as is
- Divide with 10^6 in service-runner before reporting the number.
- Switch the metric type to a gauge
- Switch the metric type to a counter, with each service-runner instance reporting the cummulative GC time for each uptime
- Start implementing native prometheus metrics support in service-runner
There are a number of pros and cons for each approach:
We leave things as is
Pros:
- Noting to be done, except perhaps document this a bit better
Cons
- People will still be occasionally confused, we will be explaining the weird behavior and adapt our graphs and prometheus metrics
Divide with 10^6 in service-runner before reporting the number.
Pros
- We decrease confusion and actually correctly use the metric type.
- prometheus-statsd-exporter reports seconds correctly, meaning no need to update anything related to it
Cons
- We lose granularity as quite a few GC cycles seems to be on the 0.5-5ms mark.
Some work to adopt the newer approach, dashboards to be updated etc.
Switch the metric type to a gauge
Pros
- We decrease confusion
- We have a metric type that is unsuffixed and hence can be used in any way we want
- We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries
Cons
- If we need to gracefully roll this out, some work needs to be done, perhaps introducing a new metric type before killing the old one
- statsd-exporter mappings will have to written to move GC stats to histograms to retain the ability to calculate percentiles
- Dashboards will have to be updated
Switch the metric type to a counter, with each service-runner instance reporting the cummulative GC time for each uptime
Pros
- We decrease confusion
- We have a metric type that is unsuffixed and hence can be used in any way we want
- We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries
Cons
- If we need to gracefully roll this out, some work needs to be done, perhaps introducing a new metric type before killing the old one
- statsd-exporter mappings will have to written to move GC stats to histograms to retain the ability to calculate percentiles
- Dashboards will have to be updated
- We need to have state about gc timings in the application. This will require some rearchitecting of service-runner probably and given the ephemerous nature of the workers it will mean the statsd stats will need to be reporter by the master.
Start implementing native prometheus metrics support in service-runner
Pros
- We decrease confusion
- We have native prometheus support
- We have a metric type that is unsuffixed and hence can be used in any way we want
- We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries
- Can be rolled out gracefully alongside the old statsd system
- No prometheus-statsd-exporter need.
Cons
- Dashboards will have to be updated
- We need to have state about gc timings in the application. This will require some rearchitecting of service-runner probably and given the ephemerous nature of the workers it will probably mean that statsd will need to be stored and served by the master.
- It's out of scope of this task and it's some large work.
[1] https://github.com/wikimedia/service-runner
[2] https://www.npmjs.com/package/gc-stats
[3] https://github.com/prometheus/statsd_exporter
[4] https://github.com/prometheus/statsd_exporter/blob/e3d6050616f7e34c0c74fa03fedb4e119e6f37b5/exporter.go#L472
[5] https://github.com/prometheus/statsd_exporter/blob/e3d6050616f7e34c0c74fa03fedb4e119e6f37b5/exporter.go#L488