Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	akosiaris
	May 8 2019, 2:04 PM

Description

Background

service-runner[1] is able to report via the statsd protocol nodejs garbage collection statistics via the gc-stats[2] node module. The module is binary and thus tends to be used only in production where we can be called to debug things and need more observability.

Nodejs GC stats is granular enough to be able to report GC cycle down nanosecond resolution. Service-runner nicely wraps it up in the heapwatch.js part of it.

Issue

However, and per my perhaps flawed understanding, back when that was originally implemented, it was required that percentiles as well as max/mins. The way chosen to obtain that behavior was via an abuse of the timing metric type of statsd. However, per https://github.com/wikimedia/service-runner/commit/b64c697cfe60f618ead890853c3ec012d0c85445 the reported value is in nanoseconds which is in contradiction of the timing statsd metric type which is in milliseconds. This leads into a confusing state where the metric+value is nanoseconds ableit suffixed as milliseconds.

This is very evident when tcpdumping the protocol where you can get thing like the following

cxserver.gc.minor:774416|ms
cxserver.gc.minor:4254832|ms
cxserver.gc.incremental:2101997|ms
cxserver.gc.major:6893031|ms
cxserver.gc.incremental:2326499|ms
cxserver.gc.major:92628052|ms
cxserver.gc.incremental:2311954|ms
cxserver.gc.major:5829335|ms
cxserver.gc.minor:1951233|ms

Lacking the internal knowledge that this are in reality nanoseconds, one just immediately thinks that we have GC cycles that take well over 1 kiloseconds, which is clearly false.

Adding to the confusion is the fact that we 've adopted prometheus-statsd-exporter[3], which, in order to stay consistent with the prometheus standards divides timing statsd metric type with 1000 in order to get seconds[4], [5].

Current state

Given that:

we are moving forward with the deprecation of posting statsd metrics to graphite and moving into fully adopting prometheus-statsd-exporter
That prometheus allows calculating percentiles and max/mins easily, especially if the metric type is a histogram
That prometheus-statsd-exporter allows arbitrarily mapping statsd metric types to prometheus statsd types

we could conceivably move the percentile calculation part in prometheus and stop abusing the timing statsd metric type.

Options

I see a number of options:

We leave things as is
Divide with 10^6 in service-runner before reporting the number.
Switch the metric type to a gauge
Switch the metric type to a counter, with each service-runner instance reporting the cummulative GC time for each uptime
Start implementing native prometheus metrics support in service-runner

There are a number of pros and cons for each approach:

We leave things as is

Pros:

Noting to be done, except perhaps document this a bit better

Cons

People will still be occasionally confused, we will be explaining the weird behavior and adapt our graphs and prometheus metrics

Divide with 10^6 in service-runner before reporting the number.

Pros

We decrease confusion and actually correctly use the metric type.
prometheus-statsd-exporter reports seconds correctly, meaning no need to update anything related to it

Cons

We lose granularity as quite a few GC cycles seems to be on the 0.5-5ms mark.

Some work to adopt the newer approach, dashboards to be updated etc.

Switch the metric type to a gauge

Pros

We decrease confusion
We have a metric type that is unsuffixed and hence can be used in any way we want
We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries

Cons

If we need to gracefully roll this out, some work needs to be done, perhaps introducing a new metric type before killing the old one
statsd-exporter mappings will have to written to move GC stats to histograms to retain the ability to calculate percentiles
Dashboards will have to be updated

Switch the metric type to a counter, with each service-runner instance reporting the cummulative GC time for each uptime

Pros

We decrease confusion
We have a metric type that is unsuffixed and hence can be used in any way we want
We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries

Cons

If we need to gracefully roll this out, some work needs to be done, perhaps introducing a new metric type before killing the old one
statsd-exporter mappings will have to written to move GC stats to histograms to retain the ability to calculate percentiles
Dashboards will have to be updated
We need to have state about gc timings in the application. This will require some rearchitecting of service-runner probably and given the ephemerous nature of the workers it will mean the statsd stats will need to be reporter by the master.

Start implementing native prometheus metrics support in service-runner

Pros

We decrease confusion
We have native prometheus support
We have a metric type that is unsuffixed and hence can be used in any way we want
We can calculate percentiles/mins/maxes/you name it using prometheus histograms/summaries
Can be rolled out gracefully alongside the old statsd system
No prometheus-statsd-exporter need.

Cons

Dashboards will have to be updated
We need to have state about gc timings in the application. This will require some rearchitecting of service-runner probably and given the ephemerous nature of the workers it will probably mean that statsd will need to be stored and served by the master.
It's out of scope of this task and it's some large work.

[1] https://github.com/wikimedia/service-runner
[2] https://www.npmjs.com/package/gc-stats
[3] https://github.com/prometheus/statsd_exporter
[4] https://github.com/prometheus/statsd_exporter/blob/e3d6050616f7e34c0c74fa03fedb4e119e6f37b5/exporter.go#L472
[5] https://github.com/prometheus/statsd_exporter/blob/e3d6050616f7e34c0c74fa03fedb4e119e6f37b5/exporter.go#L488

Details

Subject	Repo	Branch	Lines +/-
citoid, mathoid, termbox: Switch GC metric to microseconds	operations/deployment-charts	master	+134 -73
eventgate: Switch GC metric to microseconds, update buckets	operations/deployment-charts	master	+109 -91
cxserver: Switch GC stats back to microseconds	operations/deployment-charts	master	+110 -88

Customize query in gerrit

Related Objects

Mentioned In: T205870: Fully migrate producers off statsd
T220709: Upgrade statsd_exporter to 0.9
rDEPLOYCHARTS04321410c9de: citoid, mathoid, termbox: Switch GC metric to microseconds
rDEPLOYCHARTS33070727823c: eventgate: Switch GC metric to microseconds, update buckets
rDEPLOYCHARTS629ee5835c7c: cxserver: Switch GC stats back to microseconds
rDEPLOYCHARTS0a844a695e1f: cxserver: Switch GC stats back to microseconds
Mentioned Here: T205870: Fully migrate producers off statsd
T220709: Upgrade statsd_exporter to 0.9

Event Timeline

akosiaris created this task.May 8 2019, 2:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2019, 2:04 PM

akosiaris triaged this task as Low priority.May 8 2019, 2:05 PM

Ottomata awarded a token.May 8 2019, 2:13 PM

akosiaris renamed this task from Re-evaluate service-runner's (ab) of statsd timing metric for nodejs GC stats to Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.May 8 2019, 3:01 PM

CDanis subscribed.May 10 2019, 2:56 PM

Thank you for an impressive level of details :) There's a bunch of other places where we abuse the timing metric within services exactly for the reason that we've needed to have percentiles, so the decision we make here should probably be adopted elsewhere.

I personally think that "3. Switch the metric type to a gauge" is the best way forward and it's the most reasonable compromise. Implementing native Prometheus support is the perfect solution, but it's a fairly hard project and would also require rewriting a lot of service code since it will likely be API-incompatible with current metrics object in JS. I believe we should do this eventually, just maybe not in this moment.

Using gauges here makes us dependent on prometheus-metrics-exporter. For services already switched to k8s that's fine, but for services not yet on k8s that would break the metrics - we wouldn't be able to get percentiles anymore. So maybe wait till all the services are on k8s?

In T222795#5176181, @Pchelolo wrote:

Thank you for an impressive level of details :) There's a bunch of other places where we abuse the timing metric within services exactly for the reason that we've needed to have percentiles, so the decision we make here should probably be adopted elsewhere.

TIL. Thanks for that info. And yes, I agree.

I personally think that "3. Switch the metric type to a gauge" is the best way forward and it's the most reasonable compromise.

Implementing native Prometheus support is the perfect solution, but it's a fairly hard project and would also require rewriting a lot of service code since it will likely be API-incompatible with current metrics object in JS. I believe we should do this eventually, just maybe not in this moment.

That's my opinion as well.

Using gauges here makes us dependent on prometheus-metrics-exporter. For services already switched to k8s that's fine, but for services not yet on k8s that would break the metrics - we wouldn't be able to get percentiles anymore. So maybe wait till all the services are on k8s?

Why would that be? We have profile::prometheus::statsd_exporter in puppet, which runs the exact same software (up to the version) in our puppet managed hosts/services as well. We would need some mappings rules [1] but we are anyway going to need them for the move to k8s. The caveat would be we would have to update all dashboards for all services residing on the same host cluster (scb being the problem here) pretty much on the same timeframe, mostly due to the fact statsd configuration is the same for all services on the same host.

In any case, I don't think we are in any kind of rush. We 've survived with this for years, we can survive more.

[1] https://github.com/prometheus/statsd_exporter#metric-mapping-and-configuration

The caveat would be we would have to update all dashboards for all services residing on the same host cluster (scb being the problem here) pretty much on the same timeframe, mostly due to the fact statsd configuration is the same for all services on the same host.

Given that GC metrics and metrics that abuse timings are mostly informational, there are no alerts on them, I think it's not a problem to break them for a little bit while the dashboards are being updated.

Change 509811 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] cxserver: Switch GC stats back to microseconds

https://gerrit.wikimedia.org/r/509811

gerritbot added a project: Patch-For-Review.May 13 2019, 12:36 PM

akosiaris mentioned this in rDEPLOYCHARTS0a844a695e1f: cxserver: Switch GC stats back to microseconds.May 13 2019, 12:49 PM

Change 509811 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] cxserver: Switch GC stats back to microseconds

https://gerrit.wikimedia.org/r/509811

akosiaris mentioned this in rDEPLOYCHARTS629ee5835c7c: cxserver: Switch GC stats back to microseconds.May 13 2019, 1:05 PM

Change 509836 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventgate: Switch GC metric to microseconds, update buckets

https://gerrit.wikimedia.org/r/509836

akosiaris mentioned this in rDEPLOYCHARTS33070727823c: eventgate: Switch GC metric to microseconds, update buckets.May 13 2019, 1:23 PM

Change 509836 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] eventgate: Switch GC metric to microseconds, update buckets

https://gerrit.wikimedia.org/r/509836

• Pchelolo removed • holger.knust as the assignee of this task.May 16 2019, 9:40 AM

• Pchelolo added a subscriber: • holger.knust.

Change 517803 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] citoid, mathoid, termbox: Switch GC metric to microseconds

https://gerrit.wikimedia.org/r/517803

akosiaris mentioned this in rDEPLOYCHARTS04321410c9de: citoid, mathoid, termbox: Switch GC metric to microseconds.Jun 19 2019, 7:27 AM

Change 517803 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] citoid, mathoid, termbox: Switch GC metric to microseconds

https://gerrit.wikimedia.org/r/517803

Mentioned in SAL (#wikimedia-operations) [2019-06-19T08:21:19Z] <akosiaris> upgrade citoid, mathoid, termbox to latest chart releases to address the GC metric naming issue T220709 T222795

akosiaris edited projects, added serviceops-radar; removed serviceops.Jun 21 2019, 9:15 AM

akosiaris moved this task from Backlog to Watched on the serviceops-radar board.Jun 21 2019, 9:21 AM

fgiunchedi mentioned this in T205870: Fully migrate producers off statsd.Sep 6 2019, 10:40 AM

Aklapper added a project: observability.Mar 22 2020, 9:21 PM

Aklapper removed a subscriber: observability.

fgiunchedi moved this task from Inbox to Radar on the observability board.Apr 6 2020, 12:34 PM

• Mholloway subscribed.Aug 20 2020, 3:12 PM

I believe this is now (partially?) done, and service-runner supports Prometheus natively these days. What do you think @akosiaris ?

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2021, 10:10 AM

In T222795#7571702, @fgiunchedi wrote:

I believe this is now (partially?) done, and service-runner supports Prometheus natively these days. What do you think @akosiaris ?

I think we require @Ottomata's input on https://github.com/wikimedia/service-runner/pull/245 first. But yes, it does look very close to being resolved.

I had to go back and check, but both eventgate and evenstreams are using service-runner prometheus directly, and no longer using prometheus-statsd-exporter.

T205870: Fully migrate producers off statsd

Oh, is that not related? Anyway, I'm not aware of any alerts on GC stats, and at the very worst we'll have to adjust the grafana panel next time we look at it.

Proceed!

In T222795#7578389, @Ottomata wrote:

Oh, is that not related? Anyway, I'm not aware of any alerts on GC stats, and at the very worst we'll have to adjust the grafana panel next time we look at it.

Proceed!

Perfect, thanks!

The patch has been merged but there isn't much point in tracking the adoption per service in this task, so resolving. We should open a different tracking task to encourage the services to be upgraded to service-runner 3.1.0.

fgiunchedi awarded a token.Jan 4 2022, 10:05 AM

Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC statsClosed, ResolvedPublicActions

Description

Background

Issue

Current state

Options

We leave things as is

Divide with 10^6 in service-runner before reporting the number.

Switch the metric type to a gauge

Switch the metric type to a counter, with each service-runner instance reporting the cummulative GC time for each uptime

Start implementing native prometheus metrics support in service-runner

Details

Related Objects

Event Timeline

Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats
Closed, ResolvedPublic
Actions