In T347477 and T348950 we are working on upgrading some services from NodeJS 10 to NodeJS 18. It looks like there is a CPU usage (and possible latency) regression, perhaps due to GC changes in NodeJS 18.
service-runner GC metrics were removed a couple of years ago, so it is hard to diagnose this.
In version 12.0.0, prom-client added support for some default GC metrics.
Since then its also had a few breaking changes to its API.
We should:
- upgrade prom-client in service-runner, adapt to new async API
- (configurably?) call collectDefaultMetrics somewhere in service-runner when prometheus is being used