Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics
Open, Needs TriagePublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	Oct 31 2023, 6:31 PM

Description

In T347477 and T348950 we are working on upgrading some services from NodeJS 10 to NodeJS 18. It looks like there is a CPU usage (and possible latency) regression, perhaps due to GC changes in NodeJS 18.

service-runner GC metrics were removed a couple of years ago, so it is hard to diagnose this.

In version 12.0.0, prom-client added support for some default GC metrics.

Since then its also had a few breaking changes to its API.

We should:

upgrade prom-client in service-runner, adapt to new async API
(configurably?) call collectDefaultMetrics somewhere in service-runner when prometheus is being used

Related Objects

Mentioned In: T359051: eventstreams: change default num_workers to 0
T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable"
Mentioned Here: T357005: eventstreams regularly uses more than 95% of its memory limit
T347477: eventgate: eventstreams: update nodejs and OS
T348950: Upgrade change propagation to nodejs18

Event Timeline

Ottomata created this task.Oct 31 2023, 6:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2023, 6:31 PM

We can probably upgrade to prom-client 12.0.0 and get GC metrics without modifications to servier-runner. Will try that first.

^ doesn't look like it :(

Jdforrester-WMF awarded a token.Oct 31 2023, 6:37 PM

Or, maybe? I tried today and couldn't get the GC stats I wanted, but I also couldn't get them from latest prom-client either. Probably doing something wrong.

lmata moved this task from Inbox to Radar on the observability board.Nov 1 2023, 2:30 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:48 PM

amastilovic subscribed.Feb 12 2024, 11:08 PM

lbowmaker moved this task from Incoming (new tickets) to To be estimated/discussed on the Data-Engineering board.Feb 13 2024, 1:20 PM

lbowmaker set the point value for this task to 5.Feb 16 2024, 7:45 PM

lbowmaker moved this task from To be estimated/discussed to Sprint 9 on the Data-Engineering board.Feb 16 2024, 7:49 PM

lbowmaker edited projects, added Data-Engineering (Sprint 9); removed Data-Engineering.

gmodena claimed this task.Feb 26 2024, 7:12 PM

gmodena moved this task from Next Up to In progress on the Data-Engineering (Sprint 9) board.Feb 27 2024, 10:27 AM

I am taking a stab at this tasks, because we need gc and memory info to help track T357005: eventstreams regularly uses more than 95% of its memory limit.

One thing I noticed is that tests fail on node18: https://github.com/gmodena/service-runner/actions/runs/8063653898/job/22025928032 (borked bluebird dep?).

Support to node18 should follow from upgrading prom-client (and adopting newer APIs).

@Jdforrester-WMF I do wonder if we should consider deprecating node 14 and 16 though (they are both EOL and deprecated by github https://github.com/gmodena/service-runner/actions/runs/8063653898), instead of making backward compatible changes.

@Jdforrester-WMF FWIW I saw you started deprecation work in https://github.com/wikimedia/service-runner/pull/249/files.

Is that PR worth following up to / merging before tackling prom-client upgrades?

In T350180#9579885, @gmodena wrote:

@Jdforrester-WMF FWIW I saw you started deprecation work in https://github.com/wikimedia/service-runner/pull/249/files.

Is that PR worth following up to / merging before tackling prom-client upgrades?

It'd be great if someone could finally merge that PR, yes. I started adding Node 18/20 support in my spare time but given that no-one had merged the PR I didn't want to diverge too much.

Note that we currently can't release new versions of service-runner (the library was never moved to the Wikimedia org, and all npm releasers have left), we're probably also going to have to publish as a new name (e.g. @wikimedia/service-runner) we can be even more flexible in making breaking changes, so go for it! Happy to pair.

If it's to a point where we even need to use a new name, might as well break everything. I'd love to join in on the fun

Update: We've got access back, and v4.0.0 is finally released. Still happy to break whatever we need to, and help people migrate.

First stab at upgrading to v12.0.0: https://github.com/wikimedia/service-runner/pull/252

Update: We've got access back, and v4.0.0 is finally released. Still happy to break whatever we need to, and help people migrate.

Nice!

First stab at upgrading to v12.0.0: https://github.com/wikimedia/service-runner/pull/252

How would you like to coordinate for this bit of work? I picked up this phab task because of dependencies it has downstream (we need to monitor memory consumption and GC on eventgate/eventstreams), but let me know if you'd rather implement the changes (enable GC metrics collection) in your PR.
Happy to pair / code review in case.

gmodena mentioned this in T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".Feb 28 2024, 12:49 PM

In T350180#9582948, @gmodena wrote:

First stab at upgrading to v12.0.0: https://github.com/wikimedia/service-runner/pull/252

How would you like to coordinate for this bit of work? I picked up this phab task because of dependencies it has downstream (we need to monitor memory consumption and GC on eventgate/eventstreams), but let me know if you'd rather implement the changes (enable GC metrics collection) in your PR.
Happy to pair / code review in case.

If you could implement on top of my PR that'd be great.

FYI: this is a list of metrics reported with a local run:

curl http://localhost:9000/

# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 0.314283

# HELP process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE process_cpu_system_seconds_total counter
process_cpu_system_seconds_total 0.121872

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.43615499999999996

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 43597824

# HELP nodejs_eventloop_lag_seconds Lag of event loop in seconds.
# TYPE nodejs_eventloop_lag_seconds gauge
nodejs_eventloop_lag_seconds 0.002927442

# HELP nodejs_eventloop_lag_min_seconds The minimum recorded event loop delay.
# TYPE nodejs_eventloop_lag_min_seconds gauge
nodejs_eventloop_lag_min_seconds 0.00905216

# HELP nodejs_eventloop_lag_max_seconds The maximum recorded event loop delay.
# TYPE nodejs_eventloop_lag_max_seconds gauge
nodejs_eventloop_lag_max_seconds 0.015073279

# HELP nodejs_eventloop_lag_mean_seconds The mean of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_mean_seconds gauge
nodejs_eventloop_lag_mean_seconds 0.010480232160458453

# HELP nodejs_eventloop_lag_stddev_seconds The standard deviation of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_stddev_seconds gauge
nodejs_eventloop_lag_stddev_seconds 0.0004009805530038664

# HELP nodejs_eventloop_lag_p50_seconds The 50th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p50_seconds gauge
nodejs_eventloop_lag_p50_seconds 0.010403839

# HELP nodejs_eventloop_lag_p90_seconds The 90th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p90_seconds gauge
nodejs_eventloop_lag_p90_seconds 0.011075583

# HELP nodejs_eventloop_lag_p99_seconds The 99th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p99_seconds gauge
nodejs_eventloop_lag_p99_seconds 0.011247615

# HELP nodejs_active_handles Number of active libuv handles grouped by handle type. Every handle type is C++ class name.
# TYPE nodejs_active_handles gauge
nodejs_active_handles{type="Pipe"} 1
nodejs_active_handles{type="WriteStream"} 2
nodejs_active_handles{type="ReadStream"} 1
nodejs_active_handles{type="Socket"} 1

# HELP nodejs_active_handles_total Total number of active handles.
# TYPE nodejs_active_handles_total gauge
nodejs_active_handles_total 5

# HELP nodejs_active_requests Number of active libuv requests grouped by request type. Every request type is C++ class name.
# TYPE nodejs_active_requests gauge

# HELP nodejs_active_requests_total Total number of active requests.
# TYPE nodejs_active_requests_total gauge
nodejs_active_requests_total 0

# HELP nodejs_heap_size_total_bytes Process heap size from Node.js in bytes.
# TYPE nodejs_heap_size_total_bytes gauge
nodejs_heap_size_total_bytes 13193216

# HELP nodejs_heap_size_used_bytes Process heap size used from Node.js in bytes.
# TYPE nodejs_heap_size_used_bytes gauge
nodejs_heap_size_used_bytes 11829024

# HELP nodejs_external_memory_bytes Node.js external memory size in bytes.
# TYPE nodejs_external_memory_bytes gauge
nodejs_external_memory_bytes 981710

# HELP nodejs_heap_space_size_total_bytes Process heap space size total from Node.js in bytes.
# TYPE nodejs_heap_space_size_total_bytes gauge
nodejs_heap_space_size_total_bytes{space="read_only"} 176128
nodejs_heap_space_size_total_bytes{space="old"} 9150464
nodejs_heap_space_size_total_bytes{space="code"} 630784
nodejs_heap_space_size_total_bytes{space="map"} 794624
nodejs_heap_space_size_total_bytes{space="large_object"} 1392640
nodejs_heap_space_size_total_bytes{space="code_large_object"} 0
nodejs_heap_space_size_total_bytes{space="new_large_object"} 0
nodejs_heap_space_size_total_bytes{space="new"} 1048576

# HELP nodejs_heap_space_size_used_bytes Process heap space size used from Node.js in bytes.
# TYPE nodejs_heap_space_size_used_bytes gauge
nodejs_heap_space_size_used_bytes{space="read_only"} 170944
nodejs_heap_space_size_used_bytes{space="old"} 8806560
nodejs_heap_space_size_used_bytes{space="code"} 527872
nodejs_heap_space_size_used_bytes{space="map"} 462384
nodejs_heap_space_size_used_bytes{space="large_object"} 1373416
nodejs_heap_space_size_used_bytes{space="code_large_object"} 0
nodejs_heap_space_size_used_bytes{space="new_large_object"} 0
nodejs_heap_space_size_used_bytes{space="new"} 491272

# HELP nodejs_heap_space_size_available_bytes Process heap space size available from Node.js in bytes.
# TYPE nodejs_heap_space_size_available_bytes gauge
nodejs_heap_space_size_available_bytes{space="read_only"} 0
nodejs_heap_space_size_available_bytes{space="old"} 59496
nodejs_heap_space_size_available_bytes{space="code"} 53760
nodejs_heap_space_size_available_bytes{space="map"} 314720
nodejs_heap_space_size_available_bytes{space="large_object"} 0
nodejs_heap_space_size_available_bytes{space="code_large_object"} 0
nodejs_heap_space_size_available_bytes{space="new_large_object"} 1031072
nodejs_heap_space_size_available_bytes{space="new"} 539800

# HELP nodejs_version_info Node.js version info.
# TYPE nodejs_version_info gauge
nodejs_version_info{version="v16.20.2",major="16",minor="20",patch="2"} 1

# HELP nodejs_gc_duration_seconds Garbage collection duration by kind, one of major, minor, incremental or weakcb.
# TYPE nodejs_gc_duration_seconds histogram
nodejs_gc_duration_seconds_bucket{le="0.001",kind="major"} 0
nodejs_gc_duration_seconds_bucket{le="0.01",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="0.1",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="1",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="2",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="5",kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="major"} 3
nodejs_gc_duration_seconds_sum{kind="major"} 0.012455389976501464
nodejs_gc_duration_seconds_count{kind="major"} 3
nodejs_gc_duration_seconds_bucket{le="0.001",kind="incremental"} 4
nodejs_gc_duration_seconds_bucket{le="0.01",kind="incremental"} 4
nodejs_gc_duration_seconds_bucket{le="0.1",kind="incremental"} 4
nodejs_gc_duration_seconds_bucket{le="1",kind="incremental"} 4
nodejs_gc_duration_seconds_bucket{le="2",kind="incremental"} 4
nodejs_gc_duration_seconds_bucket{le="5",kind="incremental"} 4
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="incremental"} 4
nodejs_gc_duration_seconds_sum{kind="incremental"} 0.002128873109817505
nodejs_gc_duration_seconds_count{kind="incremental"} 4

# HELP nodejs_process_heap_rss_bytes process heap usage
# TYPE nodejs_process_heap_rss_bytes gauge
nodejs_process_heap_rss_bytes{service="service-runner"} 52027392

In T350180#9583483, @Jdforrester-WMF wrote:

Happy to pair / code review in case.

If you could implement on top of my PR that'd be great.

I spent some time learning this code base, touching base to validate direction. If this makes sense, I'll open a PR. My proposal here would be to add a new collect_default option to the prometheus metrics option block. When true, default metrics (including GC) should be reported:

An example of this approach is here https://github.com/wikimedia/service-runner/compare/master...gmodena:service-runner:enable-gc-metric-collection

We need to collect metrics only when running with no workers; when running in cluster mode, metrics are aggregated and we don't want that for GC traces. With the current implementation, calls to collect_default() when running in cluster would fail with prom-client tried to register the same metrics to a registry twice. I would not feel confident refactoring registry management / prometheus code paths without adding more tests first.

Since PrometheusMetric is not aware the cluster run time (num_workers), the config validation has been delegated to PrometheusServer.

This implies that to collect GC metrics we'll need to reconfigure deployed services.

gmodena mentioned this in T359051: eventstreams: change default num_workers to 0.Mar 4 2024, 1:05 PM

I spent some time learning this code base, touching base to validate direction. If this makes sense, I'll open a PR. My proposal here would be to add a new collect_default option to the prometheus metrics option block

I pushed a bit too eagerly... Got the options payload wrong in PrometheusMetrics. I need to adjust some test conditions. Fix incoming.

@Ottomata @Jdforrester-WMF there's a caveat wrt using collectDefaultMetrics. The method call does not allow setting custom labels. If i understand the doc correctly, we can still define them at registry level. This would clash with the current implementation. I'm not super keen in refactoring current behaviour given the codebase status, so I'd lean towards avoiding custom labels if possible. Are they used at all?

A couple of alternatives:

we could collect GC only metrics (with https://github.com/SimenB/node-prometheus-gc-stats), and wrap them with service-runner Metric interfaces. Probably the cleanest, but prometheus default metrics seem nice to have.
we could disable service-runner custom Metric logic when collect_default: true, and delegate both metrics and label generation to prom-client. I'm unsure what would break in this case, and we don't have enough test coverage to raise my comfort level.

FWIW for now I introduced a noop metric that when registered in heapwatch (not the best of places), delegates the collection of default metrics to prom-client. Custom labels will be ignored.

See https://github.com/wikimedia/service-runner/commit/b9c98eab5398413c16df2317562745f6ffe74439

I added a simple test for this behavior in https://github.com/wikimedia/service-runner/commit/b9c98eab5398413c16df2317562745f6ffe74439#diff-2989127eb58727c1e883e59def67cd7fdfc319ef1bc8a893749029f36b4096a0

All in all it's not super clean, but if this approach makes sense I can touch up things a bit and open a PR atop https://github.com/wikimedia/service-runner/pull/252.

gmodena moved this task from In progress to In Review on the Data-Engineering (Sprint 9) board.Mar 7 2024, 8:50 AM

See https://github.com/wikimedia/service-runner/commit/b9c98eab5398413c16df2317562745f6ffe74439

I open a WIP PR to hopefully simplify the conversation: https://github.com/wikimedia/service-runner/pull/255

lbowmaker edited projects, added Data-Engineering (Q4 2024 April 1st - June 30th); removed Data-Engineering (Sprint 9).Mon, Apr 1, 12:07 PM

lbowmaker moved this task from Next Up to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.

Restricted Application edited projects, added Data-Engineering; removed Data-Engineering (Q4 2024 April 1st - June 30th). · View Herald TranscriptMon, Apr 1, 12:07 PM

Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetricsOpen, Needs TriagePublic5 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics
Open, Needs TriagePublic5 Estimated Story Points
Actions