Page MenuHomePhabricator

increase CPU and Node heap limit?
Closed, ResolvedPublic

Description

What/Why:
We are currently focused on finding ways to improve the performance of our function-orchestrator service. Since tracking our heap usage through logs, we discovered we are too frequently using over 90% of the NodeJS heap limit. (Ticket and notes here). This week therefore, we increased the Orchestrator container k8s limit to 1Gib. Our usage has not change/decreased based on the logs however. We would like to consider ideas including:

  • Increasing the Orchestrator's CPU limit to 600ms rather than the current 400ms? If it's not too much trouble, would be great to try this out.
  • Increase k8s memory limit:
Current Heap Size Limit:, {\"total_heap_size\":75214848,\"total_heap_size_executable\":1572864,\"total_physical_size\":71139328,\"total_available_size\":2170029904,\"used_heap_size\":26766264,\"heap_size_limit\":2197815296,\"malloced_memory\":540768,\"peak_malloced_memory\":7202840,\"does_zap_garbage\":0,\"number_of_native_contexts\":1,\"number_of_detached_contexts\":0,\"total_global_handles_size\":40960,\"used_global_handles_size\":38816,\"external_memory\":2357124} !!!","service.name":"function-orchestrator"}

Which means Node has around 2GB heap limit when our k8s container tops at around 1GB.

How/Next-steps: tbd

Event Timeline

ecarg changed the task status from Open to In Progress.
ecarg triaged this task as Medium priority.
ecarg moved this task from To Triage to 25Q3 (Jan–Mar) on the Abstract Wikipedia team board.
ecarg updated the task description. (Show Details)
ecarg updated the task description. (Show Details)

After looking at v8 heap usage; it doesn't seem like v8 is consuming much of the memory, rather its the process heap and perhaps rss.

  • v8 isn't using much but rss is high despite low v8 usage; which means memory pressure is elsewhere
  • possible causes? streams/open sockets, WASM?, memory leak(s), large buffers (reading files into memory?

However:

  • maybe still worth adjusting Nodejs heap limit to be within the k8 bounds?
  • and can we try increasing CPU limit?

We'd love to get SRE input/thoughts! Would it be worth/could we try increasing the CPU limit ?

Increasing the Orchestrator's CPU limit to 600ms rather than the current 400ms?

The URL, namely https://grafana.wikimedia.org/goto/GEDK2xFHg?orgId=1, isn't a URL that grafana seems to know. What I see is the following:

image.png (609×900 px, 40 KB)

Current Heap Size Limit:, {\"total_heap_size\":75214848,\"total_heap_size_executable\":1572864,\"total_physical_size\":71139328,\"total_available_size\":2170029904,\"used_heap_size\":26766264,\"heap_size_limit\":2197815296,\"malloced_memory\":540768,\"peak_malloced_memory\":7202840,\"does_zap_garbage\":0,\"number_of_native_contexts\":1,\"number_of_detached_contexts\":0,\"total_global_handles_size\":40960,\"used_global_handles_size\":38816,\"external_memory\":2357124} !!!","service.name":"function-orchestrator"}

What generates this ? The reason I am asking is cause if this stanza is an entire copy paste (vs being truncated for the sake of being pasted into phabricator), then it's malformed JSON. I am assuming this is the output of https://nodejs.org/api/v8.html#v8_v8_getheapstatistics wrapped by some other code?

Looking a bit more at the values, looks like for this instance in time (I am using the MB = 1024*1024 definition below):

{                                                                               
  "total_heap_size": 75214848,
  "total_heap_size_executable": 1572864,
  "total_physical_size": 71139328,
  "total_available_size": 2170029904,
  "used_heap_size": 26766264,
  "heap_size_limit": 2197815296,
  "malloced_memory": 540768,
  "peak_malloced_memory": 7202840,
  "does_zap_garbage": 0,
  "number_of_native_contexts": 1,
  "number_of_detached_contexts": 0,
  "total_global_handles_size": 40960,
  "used_global_handles_size": 38816,
  "external_memory": 2357124
}
  • heap was at maximum during the lifetime of the 71.7MB
  • total memory usage was 67.8MB
  • used heap memory is 25.5MB

Which means Node has around 2GB heap limit when our k8s container tops at around 1GB.

Yes, but unless you have some indication that the 1GB limit is indeed reached, this is, operationally speaking, inconsequential. This is corroborated by Grafana Saturation for function orchestrator for last 30d

image.png (382×1 px, 56 KB)

which shows that at peaks, memory usage has barely reached 250MB and RSS is moving around the 70MB mark, which aligns with the total_physical_size metric very well.

So, there doesn't seem to be a sign that any heap limit is being reached, quite the contrary, they max out at less than 1/4th of the assigned memory limit.

That being said, it probably does make sense to try and align the 2 values, at least to avoid surprises. I am not sure if service-utils has a configuration value for this. If it does, that would be the way to go.

As far as CPU limits go, there does indeed to be some throttling, you can go and increase this one to say 1000m and see if throttling goes away. That being said, both user and system CPU usages are pretty low, barely totalling 100ms in the last 30d, so this is a bit surprising. It could align however with very often having extremely small pieces of code that yield immediately being scheduled to run on the CPUs, which the bump should fix.

TY @akosiaris!

What generates this ? The reason I am asking is cause if this stanza is an entire copy paste (vs being truncated for the sake of being pasted into phabricator), then it's malformed JSON. I am assuming this is the output of https://nodejs.org/api/v8.html#v8_v8_getheapstatistics wrapped by some other code?

Yes, and this is just something I manually put into the code as I was debugging and inspecting logs

I show note that per Grafana

image.png (1×1 px, 84 KB)

there is no discernible difference as far as CPU usage goes, at least not yet.

Thanks @akosiaris for the updates! I think we are cool to bring this back to what it was before as there is no signs of real effect; thank you!

Change #1127057 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues"

https://gerrit.wikimedia.org/r/1127057

Change #1127057 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues"

https://gerrit.wikimedia.org/r/1127057