We noticed that we had a respectable number of errors on codfw the pas few weeks, for example
Digging deeper we made the following observations, keeping in mind that eqiad's memcached traffic is ~x3 codfw's
Restarts
* eqiad: we had almost no restarts
* codfw: quite a few pods with 4-5+ restarts recorded
Allocated fibers
- eqiad: some pods where going over the 10k limit, but nothing worrisome
- codfw: many pods where going over the 10k limit, but the pattern also was verifying that we indeed had restarts
allocated fibers eqiad
allocated fibers codfw
Looking at kubernete's nodes logs, we found that mcrouter was getting oom-killed:
[Wed Sep 4 13:53:44 2024] Memory cgroup out of memory: Killed process 802345 (mcrouter) total-vm:1347040kB, anon-rss:1039776kB, file-rss:18384kB, shmem-rss:0kB, UID:778 pgtables:2364kB oom_score_adj:998
This should be considered normal, the fact remains that, eqiad serves 3 times this traffic, and does not get have any mcrouter pods oomkilled.
Update Sept 5 2024: we doubled the memory limits of mw-mcrouter. The restarts and oomkills have stopped, however, we are still observing errors.
Many thanks to @TK-999 for joining the debugging session for this


