Page MenuHomePhabricator

toolforge k8s nodes oom?
Closed, ResolvedPublic

Description

Just now we got an alert for tools-worker-1023 being down -- we've gotten at least one other similar alert this week. 1023 appears to be oom:

tools-worker-1023 login: [905477.270602] Out of memory: Kill process 4112 (java) score 1237 or sacrifice child
[905477.272482] Killed process 4112 (java) total-vm:4883480kB, anon-rss:2344756kB, file-rss:0kB, shmem-rss:0kB
[1990129.158042] Memory cgroup out of memory: Kill process 11218 (php-cgi) score 1299 or sacrifice child
[1990129.160831] Killed process 11218 (php-cgi) total-vm:1150764kB, anon-rss:693344kB, file-rss:4kB, shmem-rss:1876kB
[4995371.769324] Out of memory: Kill process 30576 (celery) score 1317 or sacrifice child
[4995371.771827] Killed process 30576 (celery) total-vm:5016380kB, anon-rss:2771392kB, file-rss:8kB, shmem-rss:0kB
[5116987.148417] Memory cgroup out of memory: Kill process 18113 (uwsgi) score 1366 or sacrifice child
[5116987.151273] Killed process 18113 (uwsgi) total-vm:986960kB, anon-rss:834172kB, file-rss:48kB, shmem-rss:116kB
[5189164.368251] Memory cgroup out of memory: Kill process 18112 (uwsgi) score 1301 or sacrifice child
[5189164.371252] Killed process 18112 (uwsgi) total-vm:850900kB, anon-rss:698576kB, file-rss:0kB, shmem-rss:116kB
[5189671.025825] Memory cgroup out of memory: Kill process 24763 (uwsgi) score 1270 or sacrifice child
[5189671.028507] Killed process 24763 (uwsgi) total-vm:785488kB, anon-rss:633148kB, file-rss:0kB, shmem-rss:116kB
[5190359.627871] Memory cgroup out of memory: Kill process 18110 (uwsgi) score 1456 or sacrifice child
[5190359.630893] Killed process 18110 (uwsgi) total-vm:1175372kB, anon-rss:1022888kB, file-rss:0kB, shmem-rss:116kB
[5190669.685051] Memory cgroup out of memory: Kill process 18111 (uwsgi) score 1658 or sacrifice child
[5190669.687083] Killed process 18111 (uwsgi) total-vm:1599160kB, anon-rss:1446628kB, file-rss:0kB, shmem-rss:116kB
[5191187.147851] Memory cgroup out of memory: Kill process 26969 (uwsgi) score 1424 or sacrifice child
[5191187.150208] Killed process 26969 (uwsgi) total-vm:1107928kB, anon-rss:955408kB, file-rss:0kB, shmem-rss:116kB
[5191290.996059] Memory cgroup out of memory: Kill process 1232 (uwsgi) score 1424 or sacrifice child
[5191290.998398] Killed process 1232 (uwsgi) total-vm:1107900kB, anon-rss:955360kB, file-rss:0kB, shmem-rss:116kB
[5191537.014654] Memory cgroup out of memory: Kill process 23332 (uwsgi) score 1402 or sacrifice child
[5191537.017966] Killed process 23332 (uwsgi) total-vm:1062684kB, anon-rss:910236kB, file-rss:0kB, shmem-rss:116kB
[5192262.710685] Memory cgroup out of memory: Kill process 2710 (uwsgi) score 1340 or sacrifice child
[5192262.712875] Killed process 2710 (uwsgi) total-vm:933264kB, anon-rss:780760kB, file-rss:0kB, shmem-rss:116kB
[5353522.538364] Memory cgroup out of memory: Kill process 8772 (uwsgi) score 1283 or sacrifice child
[5353522.541042] Killed process 8772 (uwsgi) total-vm:813440kB, anon-rss:661116kB, file-rss:0kB, shmem-rss:116kB
[5468368.203552] Memory cgroup out of memory: Kill process 25369 (uwsgi) score 1318 or sacrifice child
[5468368.208819] Killed process 25369 (uwsgi) total-vm:860256kB, anon-rss:734736kB, file-rss:0kB, shmem-rss:116kB
[5469346.238813] Memory cgroup out of memory: Kill process 25368 (uwsgi) score 1279 or sacrifice child
[5469346.240537] Killed process 25368 (uwsgi) total-vm:776672kB, anon-rss:651916kB, file-rss:0kB, shmem-rss:116kB
[5470066.719801] Memory cgroup out of memory: Kill process 25366 (uwsgi) score 1323 or sacrifice child
[5470066.721877] Killed process 25366 (uwsgi) total-vm:868672kB, anon-rss:744036kB, file-rss:0kB, shmem-rss:116kB
[5472051.101321] Memory cgroup out of memory: Kill process 25367 (uwsgi) score 1326 or sacrifice child
[5472051.104989] Killed process 25367 (uwsgi) total-vm:875904kB, anon-rss:750468kB, file-rss:0kB, shmem-rss:116kB
[5473902.704254] Memory cgroup out of memory: Kill process 14428 (uwsgi) score 1384 or sacrifice child
[5473902.706583] Killed process 14428 (uwsgi) total-vm:997816kB, anon-rss:872264kB, file-rss:0kB, shmem-rss:116kB
[5475353.942165] Memory cgroup out of memory: Kill process 25797 (uwsgi) score 1301 or sacrifice child
[5475353.945162] Killed process 25797 (uwsgi) total-vm:823264kB, anon-rss:698516kB, file-rss:0kB, shmem-rss:116kB
[5476722.932784] Memory cgroup out of memory: Kill process 27674 (uwsgi) score 1283 or sacrifice child
[5476722.934401] Killed process 27674 (uwsgi) total-vm:786780kB, anon-rss:661384kB, file-rss:0kB, shmem-rss:116kB
[5477953.654772] Memory cgroup out of memory: Kill process 3010 (uwsgi) score 1319 or sacrifice child
[5477953.656815] Killed process 3010 (uwsgi) total-vm:860604kB, anon-rss:736084kB, file-rss:8kB, shmem-rss:116kB
[5478950.652820] Memory cgroup out of memory: Kill process 18461 (uwsgi) score 1374 or sacrifice child
[5478950.655208] Killed process 18461 (uwsgi) total-vm:977956kB, anon-rss:852588kB, file-rss:0kB, shmem-rss:116kB
[5480098.827642] Memory cgroup out of memory: Kill process 3185 (uwsgi) score 1318 or sacrifice child
[5480098.829999] Killed process 3185 (uwsgi) total-vm:859896kB, anon-rss:735304kB, file-rss:0kB, shmem-rss:116kB
[5481122.132557] Memory cgroup out of memory: Kill process 19947 (uwsgi) score 1368 or sacrifice child
[5481122.135534] Killed process 19947 (uwsgi) total-vm:964508kB, anon-rss:838948kB, file-rss:0kB, shmem-rss:116kB
[5482208.325671] Memory cgroup out of memory: Kill process 2196 (uwsgi) score 1377 or sacrifice child
[5482208.328012] Killed process 2196 (uwsgi) total-vm:982892kB, anon-rss:857508kB, file-rss:0kB, shmem-rss:116kB
[5483436.984188] Memory cgroup out of memory: Kill process 15500 (uwsgi) score 1281 or sacrifice child
[5483436.985949] Killed process 15500 (uwsgi) total-vm:783300kB, anon-rss:657944kB, file-rss:0kB, shmem-rss:116kB
[5485418.477149] Memory cgroup out of memory: Kill process 8823 (uwsgi) score 1317 or sacrifice child
[5485418.480753] Killed process 8823 (uwsgi) total-vm:857064kB, anon-rss:731816kB, file-rss:0kB, shmem-rss:116kB
[5486444.782359] Memory cgroup out of memory: Kill process 28598 (uwsgi) score 1370 or sacrifice child
[5486444.786469] Killed process 28598 (uwsgi) total-vm:968920kB, anon-rss:843556kB, file-rss:0kB, shmem-rss:116kB
[5487539.561247] Memory cgroup out of memory: Kill process 21880 (uwsgi) score 1279 or sacrifice child
[5487539.565710] Killed process 21880 (uwsgi) total-vm:777676kB, anon-rss:652160kB, file-rss:0kB, shmem-rss:116kB
[5488370.139465] Memory cgroup out of memory: Kill process 4607 (uwsgi) score 1291 or sacrifice child
[5488370.142104] Killed process 4607 (uwsgi) total-vm:803892kB, anon-rss:678496kB, file-rss:0kB, shmem-rss:116kB
[5489495.265551] Memory cgroup out of memory: Kill process 8334 (uwsgi) score 1336 or sacrifice child
[5489495.268701] Killed process 8334 (uwsgi) total-vm:896240kB, anon-rss:771484kB, file-rss:0kB, shmem-rss:116kB
[5490678.400407] Memory cgroup out of memory: Kill process 28136 (uwsgi) score 1379 or sacrifice child
[5490678.404380] Killed process 28136 (uwsgi) total-vm:987484kB, anon-rss:862624kB, file-rss:0kB, shmem-rss:116kB
[5491926.588970] Memory cgroup out of memory: Kill process 21132 (uwsgi) score 1283 or sacrifice child
[5491926.591256] Killed process 21132 (uwsgi) total-vm:786528kB, anon-rss:661048kB, file-rss:0kB, shmem-rss:116kB
[5493973.296139] Memory cgroup out of memory: Kill process 14159 (uwsgi) score 1321 or sacrifice child
[5493973.298852] Killed process 14159 (uwsgi) total-vm:864552kB, anon-rss:739872kB, file-rss:0kB, shmem-rss:116kB
[5495003.469045] Memory cgroup out of memory: Kill process 32075 (uwsgi) score 1373 or sacrifice child
[5495003.471930] Killed process 32075 (uwsgi) total-vm:976056kB, anon-rss:850368kB, file-rss:0kB, shmem-rss:116kB
[5547904.818027] Out of memory: Kill process 20330 (celery) score 1147 or sacrifice child
[5547904.819886] Killed process 20330 (celery) total-vm:1947276kB, anon-rss:1291668kB, file-rss:0kB, shmem-rss:0kB

Event Timeline

I restarted the affected VM and things have recovered for now.

This happened again just now, on tools-worker-1015. I rebooted it.

[8375161.743172] Out of memory: Kill process 31642 (celery) score 1068 or sacrifice child
[8375161.744042] Killed process 31642 (celery) total-vm:1858580kB, anon-rss:606100kB, file-rss:0kB, shmem-rss:0kB
[8497931.048010] Out of memory: Kill process 9468 (java) score 1226 or sacrifice child
[8497931.048890] Killed process 9468 (java) total-vm:4816916kB, anon-rss:2243412kB, file-rss:0kB, shmem-rss:0kB
[8504547.670504] Out of memory: Kill process 32080 (java) score 1223 or sacrifice child
[8504547.671371] Killed process 32080 (java) total-vm:4816916kB, anon-rss:2224636kB, file-rss:0kB, shmem-rss:0kB
[8504809.692894] Out of memory: Kill process 19407 (java) score 1209 or sacrifice child
[8504809.693789] Killed process 19407 (java) total-vm:4463284kB, anon-rss:2095556kB, file-rss:0kB, shmem-rss:0kB
[8505323.937533] Out of memory: Kill process 20900 (java) score 1210 or sacrifice child
[8505323.938409] Killed process 20900 (java) total-vm:4463284kB, anon-rss:2110448kB, file-rss:0kB, shmem-rss:0kB

At least in the last event on worker-1015, I found that the original OOM was caused by prometheus:

Jul 22 00:58:47 tools-worker-1015 kernel: [8497931.047429] prometheus-node invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0

Not a new issue for people. https://github.com/prometheus/prometheus/issues/3470
We may have some kind of tuning to do there. Oops, nope. I read too quickly and that's actually for the server (thanks @JHedden for pointing that out)

Just in case, I grabbed a list of containers running at the time on the node, though. These are the containers that were pod managed:

k8s_POD.926d06e1_wd-depicts-1769414028-2h444_wd-depicts_376db57a-ab26-11e9-a95e-fa163ef10014_3d0f136b
k8s_POD.926d06e1_video2commons-745743424-8upt1_video2commons_37618730-ab26-11e9-a95e-fa163ef10014_8da585a1
k8s_POD.926d06e1_refill-api-1903251011-92o9b_refill-api_36d9cbcb-ab26-11e9-a95e-fa163ef10014_04b812d9
k8s_POD.926d06e1_recoin-1502025927-3c24b_recoin_36c6166f-ab26-11e9-a95e-fa163ef10014_38020565
k8s_POD.926d06e1_robin-662443925-ot54s_robin_36f99d74-ab26-11e9-a95e-fa163ef10014_ec236764
k8s_POD.926d06e1_toolschecker-k8s-ws-723003344-0zlxo_toolschecker-k8s-ws_3707ffc8-ab26-11e9-a95e-fa163ef10014_e0cecdd3
k8s_POD.926d06e1_pagepile-3475905308-yxoay_pagepile_36a9620a-ab26-11e9-a95e-fa163ef10014_8d0573c1
k8s_POD.926d06e1_orphantalk-3927252556-ubv4x_orphantalk_3688e430-ab26-11e9-a95e-fa163ef10014_4e3144db
k8s_POD.926d06e1_ordia-1176902252-iz44u_ordia_3683cf5d-ab26-11e9-a95e-fa163ef10014_44cb734a
k8s_POD.926d06e1_openrefine-wikidata-2431134617-mgkbp_openrefine-wikidata_36688d6b-ab26-11e9-a95e-fa163ef10014_85e74df3
k8s_POD.926d06e1_mw2sparql-2122011638-dn4w8_mw2sparql_364b3350-ab26-11e9-a95e-fa163ef10014_bdb9a64f
k8s_POD.926d06e1_mapillary-commons-4080084515-udlus_mapillary-commons_361969a2-ab26-11e9-a95e-fa163ef10014_305433f2
k8s_POD.926d06e1_jury-1093605029-5bj9v_jury_3607b1c1-ab26-11e9-a95e-fa163ef10014_90b0b569
k8s_POD.926d06e1_isprangefinder-3765969594-e8vg3_isprangefinder_3601ff4b-ab26-11e9-a95e-fa163ef10014_d0c35414
k8s_POD.926d06e1_isa-3401979998-3viv5_isa_35ca22b9-ab26-11e9-a95e-fa163ef10014_499e9c28
k8s_POD.926d06e1_ircredirector-559648912-ryd6w_ircredirector_358bb13a-ab26-11e9-a95e-fa163ef10014_26514a94
k8s_POD.926d06e1_freebase-958078454-jcl02_freebase_356a2feb-ab26-11e9-a95e-fa163ef10014_4b2bf87f
k8s_POD.926d06e1_externalitemsuggester-2234067946-yhzwm_externalitemsuggester_35507869-ab26-11e9-a95e-fa163ef10014_956775b4
k8s_POD.926d06e1_citations-2067602609-1rgkc_citations_35470cdf-ab26-11e9-a95e-fa163ef10014_8bb99885
k8s_POD.926d06e1_booster-3017153953-thwow_booster_353a4569-ab26-11e9-a95e-fa163ef10014_975d08a2
k8s_POD.926d06e1_anagrimes-4272036972-6vb2s_anagrimes_351f594e-ab26-11e9-a95e-fa163ef10014_7f727474
k8s_POD.926d06e1_ccm-2892567881-a36g0_ccm_e2084f76-ab25-11e9-a95e-fa163ef10014_6de4cf70
k8s_POD.926d06e1_meta-2464945829-al7a1_meta_ebbcb5f5-ab0a-11e9-a95e-fa163ef10014_358fb7ca

We run swap partitions on the toolforge kubernetes cluster, which is a terrible idea. Kubernetes of later versions will sometimes refuse to run entirely with swap (kubeadm will stop you in preflight, at least), and the devs are quite firm about it not being supported. The reason is that it confuses kubelet pretty badly, so after 1.8, kubernetes doesn't use swap even if you have it enabled. This version will, so who knows what it will do?

That said, we are running also entirely on default resource limits.

Capacity:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   4
 memory:                                8179148Ki
 pods:                                  110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   4
 memory:                                8179148Ki
 pods:                                  110

If we figure that we have a not-so-great node config (with swap), we tell kubelet to use all the RAM, when if k8s touches swap, it will do it badly, it's pretty surprising this doesn't happen every day.

We should probably deploy some resource limits on kubelet to squeeze it under the full RAM of the host system. Since we really need to replace this cluster, I don't want to spend too much time on it, but that sounds like a simple enough fix. Pod eviction is much better than OOM of essential daemons. If we figure the node exporter was unfairly accused by the oom-killer (likely), then that sounds like a good next step to me.

Bstorm triaged this task as Medium priority.Jul 22 2019, 3:30 PM

Now to find out if version 1.4 supports pod eviction limits and/or reserved memory for the system. Looks like that means digging through github.

Mentioned in SAL (#wikimedia-cloud) [2019-07-22T17:38:34Z] <bstorm_> Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats T228573

@JHedden kubernetes info is now far more likely to be working. All targets are showing they are working at least in prometheus.

That's great! Now that we have that, this query should be helpful for future investigating container_memory_usage_bytes{job="k8s-node",instance="tools-worker-1015.tools.eqiad.wmflabs"}

Based on this page, kubelet's options in anything before 1.11 are probably pretty limited. https://v1-11.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

I think we are better off using Prometheus to try to catch something interesting happening or maybe trying to fiddle with pod definitions in webservice if it comes to that.

This just happened again. A celery container for refill-api suddenly started filling ram like mad right before it happened.

Mentioned in SAL (#wikimedia-cloud) [2019-07-25T22:01:35Z] <bstorm_> T228573 created tools-worker-1030

Mentioned in SAL (#wikimedia-cloud) [2019-07-26T15:54:59Z] <bstorm_> created tools-worker-1031 - T228573

Mentioned in SAL (#wikimedia-cloud) [2019-07-26T15:57:01Z] <bstorm_> created tools-worker-1032 and 1033 - T228573

Mentioned in SAL (#wikimedia-cloud) [2019-07-26T16:32:21Z] <bstorm_> created tools-worker-1034 - T228573

I've drained a couple nodes to rebalance things a bit. Overall, I think 5 more isn't enough. These have been running vastly oversubscribed for a bit too long. That said, I know we have cloudvirts slated for rebuilds and there's at least one down for repairs. I'll watch how things operate a while before deploying much more (or check the capacity more carefully before building out more).

Mentioned in SAL (#wikimedia-cloud) [2019-07-31T15:33:50Z] <bstorm_> T228573 spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)

So workers that were seeing well over 30 pods (after a lot of manual draining and letting things rebalance) are now down to under 27 pods. Each pod is a webservice in this case. That's still too high, so I'm continuing to rebalance them. <20 is saner (perhaps not great, but saner) with current limits.

Now now single node is over 50 containers running. That is hardly a large victory, but the cluster is more stable like this. We'll see how the overall capacity looks in openstack, and maybe I'll put up another set o 5 nodes tomorrow. @Andrew please let me know if it is looking problematic at this point.

Just saw this again at 20:21 UTC today.

The OOM events coincided with a celery worker starting up for the refill-api tool again.

That celery worker was in its 3rd day of running and the RAM growth followed a somewhat logarithmic pattern. It was still climbing, though.

Summarizing information passed on by @Bstorm in various chats about this issue:

  • Growth of Kubernetes usage has not been matched with growth of the cluster. She has been adding new worker nodes slowly to provide more space for Pods after realizing this.
  • The current Kubernetes version running in Toolforge has a not very smart scheduler which will gladly oversubscribe the ram & cpu on any given worker node without limit.
  • The current OS for our Kubernetes workers (Jessie) barely supports cgroups and that is also making Kubernetes bad at limiting resources for running Pods.
  • T214513: Deploy and migrate tools to a Kubernetes v1.15 or newer cluster will change things regarding scheduling, overcommit, and runtime limit enforcement.
Bstorm claimed this task.

I think this has stopped for now. Resolving, but it will come back until we have better capacity definitions.