eventstreams regularly uses more than 95% of its memory limit
Open, In Progress, LowPublic
Actions

Assigned To

Authored By

	Clement_Goubert
	Feb 8 2024, 2:44 PM

Description

As we can see from alert history, eventstreams regularly hits the alerting threshold for sustained memory usage.

Its current limit is 1000Mi, I think raising it to 1100Mi would avoid triggering the alert as often.

Details

	Subject	Repo	Branch	Lines +/-
	eventstreams: Raise memory limit to 1100Mi	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		akosiaris	T266216 Increase visibility of container/pod ressource exhaustion
		In Progress		Clement_Goubert	T357005 eventstreams regularly uses more than 95% of its memory limit

Event Timeline

Clement_Goubert created this task.Feb 8 2024, 2:44 PM

Clement_Goubert added a project: EventStreams.

Change 998945 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] eventstreams: Raise memory limit to 1100Mi

https://gerrit.wikimedia.org/r/998945

gerritbot added a project: Patch-For-Review.Feb 8 2024, 3:00 PM

I am not sure this would actually solve the problem tbh. It don't hurt to try ofc, which is why I +1e, but looking into the following 7d graph displaying max/min/stdev/avg among the 9 pods in codfw paints a picture where we probably have like 1 or 2 pods at high memory usages, indicating some pattern of usage causing this.

And this is corroborated by a topk(5) query

Note how it tends to be 1 or 2 pods that are close to the limit with the rest 3 well below that, close to the average. And there are another 4 that are well below that as shown in the bottomk(4) query below

(this is actually a 12h graph, it's a bit clearer, but the pattern stands)

Since eventstreams is streaming data to long running clients, I am starting to think that there is some memory leak somewhere and it's plaguing pods dependent on what the clients ask and how long they are connected to a pod.

Maximum concurrent clients appear to be ~100 for recentchange everything else is single digits. Max stream connection duration appears to be ~10days.

Links

Sorry for not pasting links to those explore graphs, w.wiki complained that I went above the 2k url limit it supports.

Here is my query

topk(5, container_memory_usage_bytes{namespace="eventstreams", container="eventstreams"})

with some max, min, stddev, avg variants.

Change 998945 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams: Raise memory limit to 1100Mi

https://gerrit.wikimedia.org/r/998945

Maintenance_bot removed a project: Patch-For-Review.Feb 9 2024, 11:30 AM

Deployed, we'll see how the memory consumption evolves.

I agree the data above is a strong indicator of a memory leak, although I'm wondering about the stream connection duration metric considering the pods themselves were all a maximum of ~2 days old when I deployed the memory limit change today.

Clement_Goubert changed the task status from Open to In Progress.Feb 9 2024, 11:52 AM

Clement_Goubert triaged this task as Low priority.

Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

wondering about the stream connection duration

IIRC, varnish(?) sets a http timeout of something like 15 minutes, so something doesn't seem right about that metric to me either.

There's no routing stickiness, so a reconnecting client could be routed to any pod. Something smells leaky to me too...

Is this increasing mem usage a new occurance?

cc @gmodena @tchin

Looking at the logs, this seems to coincide with the redaction patch to eventstreams, but looking at the code I'm having a hard time finding where a memory leak could've happened... more confusing that it's just 1 or 2 pods hitting the limit

tchin edited projects, added Data-Engineering (Sprint 8); removed Data-Engineering.Feb 11 2024, 3:03 AM

tchin moved this task from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 8) board.

If you have time to dive deep, you can live inspect a nodejs process and search for memory leaks.

https://www.toptal.com/nodejs/debugging-memory-leaks-node-js-applications

Enable debug mode in helmfile and redeploy:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventstreams/values.yaml#37

You can connect chrome debug tools to the node inspector port.

It might be easier to try this locally in your dev env, if is possible.

In T357005#9531775, @tchin wrote:

Looking at the logs, this seems to coincide with the redaction patch to eventstreams, but looking at the code I'm having a hard time finding where a memory leak could've happened... more confusing that it's just 1 or 2 pods hitting the limit

Good catch.

We do add a ~60 KBs of overhead (list of pages) to router.get code path, but at these traffic volumes it should have not have had a significant impact on saturation.

lbowmaker edited projects, added Data-Engineering (Sprint 9); removed Data-Engineering (Sprint 8).Feb 16 2024, 4:15 PM

lbowmaker moved this task from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 9) board.

gmodena mentioned this in T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.Feb 27 2024, 11:11 AM

lbowmaker edited projects, added Data-Engineering (Q4 2024 April 1st - June 30th); removed Data-Engineering (Sprint 9).Mon, Apr 1, 12:07 PM

lbowmaker moved this task from Next Up to Radar (External Teams) on the Data-Engineering (Q4 2024 April 1st - June 30th) board.

Restricted Application edited projects, added Data-Engineering; removed Data-Engineering (Q4 2024 April 1st - June 30th). · View Herald TranscriptMon, Apr 1, 12:07 PM

	F41816487: image.png
	Feb 8 2024, 4:18 PM

	F41816454: image.png
	Feb 8 2024, 4:18 PM

	F41816371: image.png
	Feb 8 2024, 4:18 PM

eventstreams regularly uses more than 95% of its memory limitOpen, In Progress, LowPublicActions