Page MenuHomePhabricator

Kubernetes workers frequent oom-killer in action
Closed, InvalidPublic

Description

Looking at an unrelated issue (cron spam) I've noticed that the in the kubernetes workers hosts the oom-killer is invoked quite frequently, although not on all of them. Just reporting it in case it's not known.

$ sudo cumin -x 'A:kubernetes-workers' 'dmesg -T | grep -c "oom-killer"'
IGNORE EXIT CODES mode enabled, all commands executed will be considered successful
12 hosts will be targeted:
kubernetes[2001-2006].codfw.wmnet,kubernetes[1001-1006].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) kubernetes2004.codfw.wmnet
----- OUTPUT of 'dmesg -T | grep -c "oom-killer"' -----
40
===== NODE GROUP =====
(1) kubernetes2002.codfw.wmnet
----- OUTPUT of 'dmesg -T | grep -c "oom-killer"' -----
47
===== NODE GROUP =====
(2) kubernetes[2001,2003].codfw.wmnet
----- OUTPUT of 'dmesg -T | grep -c "oom-killer"' -----
43
===== NODE GROUP =====
(4) kubernetes[2005-2006].codfw.wmnet,kubernetes[1005-1006].eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep -c "oom-killer"' -----
0
===== NODE GROUP =====
(2) kubernetes[1001,1003].eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep -c "oom-killer"' -----
71
===== NODE GROUP =====
(1) kubernetes1004.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep -c "oom-killer"' -----
70
===== NODE GROUP =====
(1) kubernetes1002.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep -c "oom-killer"' -----
67
================
PASS:  |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100% (12/12) [00:00<00:00, 13.43hosts/s]
FAIL:  |                                                                                                           |   0% (0/12) [00:00<?, ?hosts/s]
100.0% (12/12) success ratio (>= 100.0% threshold) for command: 'dmesg -T | grep -c "oom-killer"'.
100.0% (12/12) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptNov 3 2019, 3:35 PM
Joe claimed this task.
Joe subscribed.

So:

  • kubernetes{1,2}00{5,6} are specialized nodes that only run kask for sessions, that's why you don't see ooms there.
  • The OOM killer doesn't only get called when the memory of the whole system exceeds its limits, but also when the processes in a cgroup try to allocate more memory than what is allowed to that cgroup.

So what is happening there is single processes exceeding their limits.

Resolving as invalid.

As @Joe said, that's expected. It's how misbehaving services are killed in order to recover. Here's also a breakdown in case anyone is interested

kubectl get pods --all-namespaces
NAMESPACE             NAME                                        READY   STATUS    RESTARTS   AGE
blubberoid            blubberoid-production-6cb888677-4pldd       1/1     Running   0          32d
blubberoid            blubberoid-production-6cb888677-77w8k       1/1     Running   0          32d
blubberoid            blubberoid-production-6cb888677-7fm6x       1/1     Running   0          32d
blubberoid            blubberoid-production-6cb888677-xls4k       1/1     Running   0          32d
blubberoid            tiller-deploy-5bcd56b799-pxk98              1/1     Running   0          32d
citoid                citoid-production-699db9c4-4bbsx            2/2     Running   66         32d
citoid                citoid-production-699db9c4-jgkw4            2/2     Running   79         32d
citoid                citoid-production-699db9c4-lnfgh            2/2     Running   69         32d
citoid                citoid-production-699db9c4-rtmt8            2/2     Running   79         32d
citoid                citoid-production-699db9c4-sqgxk            2/2     Running   66         32d
citoid                citoid-production-699db9c4-tdmrg            2/2     Running   69         32d
citoid                citoid-production-699db9c4-tnzfq            2/2     Running   67         32d
citoid                citoid-production-699db9c4-tpt7r            2/2     Running   82         32d
citoid                tiller-deploy-7597ffbb8f-8wktl              1/1     Running   0          32d
cxserver              cxserver-production-6cff8b9c8d-4bpnk        2/2     Running   0          19d
cxserver              cxserver-production-6cff8b9c8d-blwx5        2/2     Running   0          19d
cxserver              cxserver-production-6cff8b9c8d-ltz77        2/2     Running   0          19d
cxserver              cxserver-production-6cff8b9c8d-m8p7k        2/2     Running   0          19d
cxserver              cxserver-production-6cff8b9c8d-mtrm2        2/2     Running   0          19d
cxserver              cxserver-production-6cff8b9c8d-nrg9c        2/2     Running   0          19d
cxserver              cxserver-production-6cff8b9c8d-rgm5n        2/2     Running   0          19d
cxserver              cxserver-production-6cff8b9c8d-s7gnx        2/2     Running   0          19d
cxserver              tiller-deploy-5d56bbbd45-g24gk              1/1     Running   0          32d
echostore             kask-production-5c9bddd576-22hjq            1/1     Running   0          5d22h
echostore             kask-production-5c9bddd576-bm75w            1/1     Running   0          5d22h
echostore             kask-production-5c9bddd576-bnj54            1/1     Running   0          5d22h
echostore             kask-production-5c9bddd576-d5xvg            1/1     Running   0          5d22h
echostore             tiller-deploy-97d7ddd67-2kgh4               1/1     Running   0          18d
eventgate-analytics   eventgate-analytics-cd86c8c7d-2br4n         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-2m449         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-7795n         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-9449p         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-cvtlz         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-d2d8q         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-dmvzz         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-fgbs2         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-fz7rf         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-g99vc         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-j5npz         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-k9rb8         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-nfkzq         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-p4nq5         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-px75d         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-tzlkz         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-v84rz         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-wgmp9         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-wlbww         2/2     Running   0          32d
eventgate-analytics   eventgate-analytics-cd86c8c7d-x6kxs         2/2     Running   0          32d
eventgate-analytics   tiller-deploy-6d4bfc855f-tqn97              1/1     Running   0          32d
eventgate-main        eventgate-main-d56c6b764-5s9gk              2/2     Running   8          32d
eventgate-main        eventgate-main-d56c6b764-dvbfj              2/2     Running   9          32d
eventgate-main        eventgate-main-d56c6b764-v74mp              2/2     Running   6          32d
eventgate-main        tiller-deploy-845764c94b-qsswb              1/1     Running   0          32d
graphoid              tiller-deploy-56cc65cbfc-r8rdp              1/1     Running   0          32d
kube-system           calico-policy-controller-59ff766d9f-ksp5s   1/1     Running   0          26d
kube-system           coredns-85dcd67fc4-8xkkl                    1/1     Running   0          32d
kube-system           coredns-85dcd67fc4-mfzll                    1/1     Running   0          32d
kube-system           coredns-85dcd67fc4-mhjg7                    1/1     Running   0          32d
kube-system           coredns-85dcd67fc4-w9khj                    1/1     Running   0          32d
kube-system           tiller-deploy-6d77c9db8d-tgj7h              1/1     Running   0          32d
mathoid               mathoid-production-6b7664c96-2lh2p          2/2     Running   11         32d
mathoid               mathoid-production-6b7664c96-4pqr9          2/2     Running   5          32d
mathoid               mathoid-production-6b7664c96-648vv          2/2     Running   9          32d
mathoid               mathoid-production-6b7664c96-6gpcv          2/2     Running   9          32d
mathoid               mathoid-production-6b7664c96-8rgmn          2/2     Running   8          32d
mathoid               mathoid-production-6b7664c96-9pvm9          2/2     Running   8          32d
mathoid               mathoid-production-6b7664c96-bfltv          2/2     Running   8          32d
mathoid               mathoid-production-6b7664c96-bg57v          2/2     Running   6          32d
mathoid               mathoid-production-6b7664c96-cdgpr          2/2     Running   9          32d
mathoid               mathoid-production-6b7664c96-cfvrp          2/2     Running   5          32d
mathoid               mathoid-production-6b7664c96-fdd6r          2/2     Running   6          32d
mathoid               mathoid-production-6b7664c96-fvh4z          2/2     Running   51         32d
mathoid               mathoid-production-6b7664c96-glb6q          2/2     Running   11         32d
mathoid               mathoid-production-6b7664c96-jh77s          2/2     Running   9          32d
mathoid               mathoid-production-6b7664c96-jj7f9          2/2     Running   10         32d
mathoid               mathoid-production-6b7664c96-jjgqz          2/2     Running   12         32d
mathoid               mathoid-production-6b7664c96-jv6mz          2/2     Running   7          32d
mathoid               mathoid-production-6b7664c96-k2n6n          2/2     Running   5          32d
mathoid               mathoid-production-6b7664c96-kfpmf          2/2     Running   6          32d
mathoid               mathoid-production-6b7664c96-lwrgz          2/2     Running   10         32d
mathoid               mathoid-production-6b7664c96-lzkwv          2/2     Running   7          32d
mathoid               mathoid-production-6b7664c96-pbm8m          2/2     Running   6          32d
mathoid               mathoid-production-6b7664c96-q5fzs          2/2     Running   10         32d
mathoid               mathoid-production-6b7664c96-qj968          2/2     Running   5          32d
mathoid               mathoid-production-6b7664c96-qnb7w          2/2     Running   8          32d
mathoid               mathoid-production-6b7664c96-qsshp          2/2     Running   10         32d
mathoid               mathoid-production-6b7664c96-r77ww          2/2     Running   7          32d
mathoid               mathoid-production-6b7664c96-vbxcp          2/2     Running   12         32d
mathoid               mathoid-production-6b7664c96-xnc8s          2/2     Running   9          32d
mathoid               mathoid-production-6b7664c96-zxjmz          2/2     Running   8          32d
mathoid               tiller-deploy-7468fdbbc-phq5j               1/1     Running   0          32d
restrouter            restrouter-production-86c98bdcd5-2xpvk      2/2     Running   0          21d
restrouter            restrouter-production-86c98bdcd5-7j2zl      2/2     Running   0          21d
restrouter            restrouter-production-86c98bdcd5-7whvj      2/2     Running   0          21d
restrouter            restrouter-production-86c98bdcd5-bv24b      2/2     Running   0          21d
restrouter            restrouter-production-86c98bdcd5-dcjxr      2/2     Running   0          21d
restrouter            restrouter-production-86c98bdcd5-p9j98      2/2     Running   0          21d
restrouter            restrouter-production-86c98bdcd5-vk6t8      2/2     Running   0          21d
restrouter            restrouter-production-86c98bdcd5-x5gwr      2/2     Running   0          21d
restrouter            tiller-deploy-59959f5d-gfkpm                1/1     Running   0          32d
sessionstore          kask-production-54cfd7777b-cjn6f            1/1     Running   0          31d
sessionstore          kask-production-54cfd7777b-rvprv            1/1     Running   0          31d
sessionstore          kask-production-54cfd7777b-s8nqx            1/1     Running   0          31d
sessionstore          kask-production-54cfd7777b-stswt            1/1     Running   0          31d
sessionstore          tiller-deploy-8b4484dfb-8qk89               1/1     Running   0          32d
termbox               termbox-production-7cf45b7fd5-4h7mq         2/2     Running   0          6d2h
termbox               termbox-production-7cf45b7fd5-d485k         2/2     Running   0          6d2h
termbox               termbox-production-7cf45b7fd5-p7d78         2/2     Running   0          6d2h
termbox               termbox-production-7cf45b7fd5-vx96q         2/2     Running   0          6d2h
termbox               tiller-deploy-69ccb7b9b6-kgvhs              1/1     Running   0          32d
wikifeeds             tiller-deploy-d68d5565b-cfmk9               1/1     Running   0          32d
wikifeeds             wikifeeds-production-75495d9fbb-45hdz       2/2     Running   1          25d
wikifeeds             wikifeeds-production-75495d9fbb-c99vc       2/2     Running   0          25d
wikifeeds             wikifeeds-production-75495d9fbb-jbgkp       2/2     Running   1          25d
wikifeeds             wikifeeds-production-75495d9fbb-lk6mg       2/2     Running   0          25d
zotero                tiller-deploy-558dbb554-jdfqc               1/1     Running   0          32d
zotero                zotero-production-5c87557f94-775nz          1/1     Running   4          32d
zotero                zotero-production-5c87557f94-9hnkw          1/1     Running   5          32d
zotero                zotero-production-5c87557f94-9n8k6          1/1     Running   3          22d
zotero                zotero-production-5c87557f94-9xk6f          1/1     Running   1          22d
zotero                zotero-production-5c87557f94-fz9mv          1/1     Running   1          22d
zotero                zotero-production-5c87557f94-h9trd          1/1     Running   4          32d
zotero                zotero-production-5c87557f94-lnqfp          1/1     Running   2          32d
zotero                zotero-production-5c87557f94-nnm7c          1/1     Running   3          32d
zotero                zotero-production-5c87557f94-pg6ps          1/1     Running   3          32d
zotero                zotero-production-5c87557f94-z52qz          1/1     Running   2          32d

It's mostly citoid, zotero, mathoid. 2 of them parse random urls off the internet, so I 'd say expected. The 3rd one parses arbitrarily large LaTeX formulas so probably expected as well. I 'have a look though in case it makes sense to bump the limits a bit.