Page MenuHomePhabricator

Production warning: Node OOMKilling something
Closed, ResolvedPublic

Description

Not sure whats happening but it's reporting OOMkilling something and that's usually not good.

None of the pods report being restarted so doesn't make much sense to me.

logs: https://cloudlogging.app.goo.gl/rZu695w2NRjEEDPc7

kubectl describe node gke-wbaas-3-medium-pool-f591592a-k67c


Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        689m (73%)    2060m (219%)
  memory                     1367Mi (48%)  3341Mi (118%)
  ephemeral-storage          0 (0%)        0 (0%)
  hugepages-1Gi              0 (0%)        0 (0%)
  hugepages-2Mi              0 (0%)        0 (0%)
  attachable-volumes-gce-pd  0             0
Events:
  Type     Reason      Age   From            Message
  ----     ------      ----  ----            -------
  Warning  OOMKilling  47m   kernel-monitor  Memory cgroup out of memory: Killed process 541912 (apache2) total-vm:458228kB, anon-rss:141820kB, file-rss:15828kB, shmem-rss:45480kB, UID:33 pgtables:552kB oom_score_adj:937
  Warning  OOMKilling  44m   kernel-monitor  Memory cgroup out of memory: Killed process 542334 (apache2) total-vm:471132kB, anon-rss:154468kB, file-rss:16636kB, shmem-rss:49848kB, UID:33 pgtables:584kB oom_score_adj:937
  Warning  OOMKilling  15m   kernel-monitor  Memory cgroup out of memory: Killed process 542296 (apache2) total-vm:462364kB, anon-rss:146204kB, file-rss:13684kB, shmem-rss:42372kB, UID:33 pgtables:528kB oom_score_adj:937
  Warning  OOMKilling  10m   kernel-monitor  Memory cgroup out of memory: Killed process 557837 (apache2) total-vm:460160kB, anon-rss:142572kB, file-rss:15704kB, shmem-rss:43632kB, UID:33 pgtables:560kB oom_score_adj:937

Event Timeline

This still seems to be a problem. This is likely because they are consistently using more memory than requested. We should therefore investigate if we should increase the requested memory of the mediawiki pods.

You can see we are consistently using waaaaay more memory than we request:

image.png (295×539 px, 13 KB)

None of the nodes seem to have reports of these events any longer. (could be because of restarts maybe)

We haven't seen any of these errors in a while for the nodes, i'd suggest we close this ticket as resolved or declined.