Page MenuHomePhabricator

Mediawiki mcrouter errors on codfw
Closed, ResolvedPublic

Assigned To
Authored By
jijiki
Sep 4 2024, 3:47 PM
Referenced Files
F57489292: image.png
Sep 9 2024, 1:28 PM
F57489290: image.png
Sep 9 2024, 1:28 PM
F57489243: image.png
Sep 9 2024, 1:28 PM

Description

We noticed that we had a respectable number of errors on codfw the pas few weeks, for example

image.png (1,496×635 px, 153 KB)

Digging deeper we made the following observations, keeping in mind that eqiad's memcached traffic is ~x3 codfw's

Restarts
* eqiad: we had almost no restarts
* codfw: quite a few pods with 4-5+ restarts recorded

Allocated fibers

  • eqiad: some pods where going over the 10k limit, but nothing worrisome
  • codfw: many pods where going over the 10k limit, but the pattern also was verifying that we indeed had restarts

allocated fibers eqiad
allocated fibers codfw

image.png (2,960×1,068 px, 283 KB)

image.png (2,982×1,074 px, 636 KB)

Looking at kubernete's nodes logs, we found that mcrouter was getting oom-killed:

[Wed Sep 4 13:53:44 2024] Memory cgroup out of memory: Killed process 802345 (mcrouter) total-vm:1347040kB, anon-rss:1039776kB, file-rss:18384kB, shmem-rss:0kB, UID:778 pgtables:2364kB oom_score_adj:998

This should be considered normal, the fact remains that, eqiad serves 3 times this traffic, and does not get have any mcrouter pods oomkilled.

Update Sept 5 2024: we doubled the memory limits of mw-mcrouter. The restarts and oomkills have stopped, however, we are still observing errors.

Many thanks to @TK-999 for joining the debugging session for this

Details

Event Timeline

Change #1070633 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mcrouter: double mem limits

https://gerrit.wikimedia.org/r/1070633

Change #1070633 merged by jenkins-bot:

[operations/deployment-charts@master] mcrouter: double mem limits

https://gerrit.wikimedia.org/r/1070633

jijiki renamed this task from mcrouter getting oomkilled to Mediawiki mcrouter errors on codfw.Sep 9 2024, 1:28 PM
jijiki claimed this task.
jijiki added a project: serviceops-deprecated.
jijiki updated the task description. (Show Details)
jijiki updated the task description. (Show Details)
jijiki added a subscriber: TK-999.

We have concluded so far, that the errors are related to T374366, so we are closing this for now.