I've been doing some work to improve the performance of the copyvios tool, which is now running on k8s after using the old grid for a long time. k8s enforces a memory limit of 2G, despite the tool previously having its limit raised to 6G. I'm not sure I need that much, but 4G would help a lot, I think. Can this limit be raised easily?
Is there any evidence that the tool is running out of memory? Grid counts memory by virtual size (which is IMO not sane), but though I can't google any relevant information, I'm inclined to think k8s calculates memory by resident set size (which is IMO much saner) and you are much less likely to run out of it, even with a lower threshold. <rant>This is why people say 'I don't need so much memory if I run it at home but why do I have to specify such large number in -mem'.</rant>
uWSGI logs the following every several hours, which I assume is the OOM-killer:
DAMN ! worker 1 (pid: 546) died, killed by signal 9 :( trying respawn ...
Admittedly I'm running with more workers than normal (8 instead of 4), but even with 4, I was seeing these messages every so often. The reason I've upped the number of workers is that individual requests can take a long time, so it's possible for all 4 to be occupied, causing requests to back up.
I've done some work to see if this is a memory leak in my tool in the past, but never found anything. However, since you've mentioned that the old grid and k8s count memory differently, I'm thinking I'm just being greedy. I'll do some further research to see if I can finally pinpoint the cause.
Hmm the dmesg is a bit confusing (redacted all kernel addresses because of kaslr):
Checking line 38 and line 33, line 38's numeric values of total_vm and rss are 4 times as big as line 33, so line 33 must be counted in pages.
The sum of total_vm is 1922993 (pages, = 7691972kB), while the sum of rss is 541414 (pages, = 2165656kB). Line 21 says the cgroup 'memory' (which type?) usage is 2097152kB and then line 24 says rss usage is 2077000kB. So is the total RSS 2077000kB or 2165656kB?
Looking at https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt 2.2.1 says RSS + Page Cache are accounted, then 2077000+6124 = 2083124kB, still 14028kB off from 2097152kB, which is exactly the amount of mapped kernel memory in line 23.
My best guess from that is that cgroup indeed counts by RSS and your RSS is indeed pretty big, although the exact method it sums up the RSS seems slightly off (2165656kB vs 2077000 kB, 4%) from what I'd expect from numerically summing.
Though, one way you could check for memory leak is getting a core dump (if you want you could tell me a pid and I can get you a core dump via gdb while the process is suspended), but trying to determine what is leaking from the core dump could be tedious.
Another way would be to use some libraries to collect statistics on what objects are currently allocated. https://stackoverflow.com/q/1435415 has some examples for such libraries. I've personally used https://pypi.org/project/mem_top/ once or twice but last time I tried it only works for Python 2. Note though, that mem_top library does an implicit invoke to gc, so if the issue just disappears after using such libraries then it could be possible that it is gc being too infrequent for some obscure reason. (I encounter this issue once outside Toolforge. Solution? Manually invoke gc periodically *facepalm*. Honestly, even after reading Python's gc documentation twice I still don't understand when gc is invoked automatically.)
Did some investigating with my tool of choice guppy and found a potential "leak" (really shouldn't be, but apparently a stack frame was living longer than intended and keeping a bunch of things alive with it). With that cleaned up, the pure-Python tools no longer seem to be reporting any leak candidates, but memory usage still seems kinda high. I'll follow up.
I've managed to fix a couple more bugs and poor design choices in the tool, and it looks like memory usage has fallen to more reasonable levels, so I'm closing this ticket. Thanks for the help earlier!