Page MenuHomePhabricator

linkrecommendation-internal regularly uses more than 95% of its memory limit
Open, In Progress, LowPublic

Description

As we can see from alert history, linkrecommendation-internal regularly hits the alerting threshold for sustained memory usage.

Its overall memory usage stays pretty stable, but too close to the 850MiB limit.

As it is, average usage is 762MiB, so I'm proposing to raise its requests to 750MiB and its limit to 950MiB.

Event Timeline

Clement_Goubert changed the task status from Open to In Progress.Feb 9 2024, 12:35 PM
Clement_Goubert triaged this task as Low priority.
Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

I just saw @akosiaris already bumped it 200MiB last week. If that new naive increase isn't sufficient, it may warrant more investigation into the memory usage behavior of linkrecommendation.

Already bumped by 200Mi in a9f958e50e5f5f4a8 ( T266216 ). I think we 'll need instead to dig a bit into why this is happening.

Change 999698 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] linkrecommendation-internal: Raise memory requests and limits

https://gerrit.wikimedia.org/r/999698

Already bumped by 200Mi in a9f958e50e5f5f4a8 ( T266216 ). I think we 'll need instead to dig a bit into why this is happening.

I'm wondering whether this might be happening because specific kind of requests might be more expensive than others. I have a log of requests that MediaWiki triggered, but I can't map that to individual containers. Is there some already-existing logging that can be used to work with situations like this one? Or do we have some other profiling setup for Python-based services?

It doesn't look like linkrecommendation logs a request_id that would correspond to the mediawiki request. It should probably be passed and logged to facilitate tracing these kinds of issues.
Proper profiling would need some software side instrumentation like tracemalloc for memory specifically, gated by a feature flag in the chart, then deploying a specific (for example) linkrecommendation-internal-debug release with that feature flag turned on and using routed_via: internal in its values file so it gets some of the same traffic linkrecommendation-internal does.

It doesn't look like linkrecommendation logs a request_id that would correspond to the mediawiki request. It should probably be passed and logged to facilitate tracing these kinds of issues.

Does that mean linkrecommendation should be logging each request? If so, that can be already done (log sensitivity can be controlled by FLASK_LOGLEVEL; currently it's set to WARNING in production). I think we did that in the past; however, from T296334: Make linkrecommendation service logging more useful and linked tasks, I have the impression that Growth worked in the past on reducing the log volume in the past, so I'm unsure whether doing that would be helpful.

Proper profiling would need some software side instrumentation like tracemalloc for memory specifically, gated by a feature flag in the chart, then deploying a specific (for example) linkrecommendation-internal-debug release with that feature flag turned on and using routed_via: internal in its values file so it gets some of the same traffic linkrecommendation-internal does.

Ack, thanks.