Page MenuHomePhabricator

Pods in evicted state for various namespaces in k8s main
Closed, ResolvedPublic

Description

While investigating a wikifeeds alert with Hugh we discovered this state in codfw:

elukey@kubemaster2001:~$ kubectl get pods -n wikifeeds 
NAME                                    READY   STATUS      RESTARTS   AGE
tiller-677d974595-bmcbw                 1/1     Running     0          151d
wikifeeds-production-577cfd6bd4-6mklk   0/3     Evicted     0          18d
wikifeeds-production-577cfd6bd4-rptj4   0/3     Evicted     0          21d
wikifeeds-production-6db5957576-58m6c   0/3     Evicted     0          23h
wikifeeds-production-6db5957576-7kn2x   0/3     Evicted     0          6h2m
wikifeeds-production-6db5957576-8fr4v   0/3     Evicted     0          23h
wikifeeds-production-6db5957576-bwfvl   3/3     Running     0          23h
wikifeeds-production-6db5957576-bzrrr   3/3     Running     0          20h
wikifeeds-production-6db5957576-d5g6f   0/3     Evicted     0          6h2m
wikifeeds-production-6db5957576-fbbjv   0/3     Evicted     0          23h
wikifeeds-production-6db5957576-fd72r   3/3     Running     0          4d3h
wikifeeds-production-6db5957576-fn56d   0/3     Evicted     0          23h
wikifeeds-production-6db5957576-gbrs9   0/3     Evicted     0          6h2m
wikifeeds-production-6db5957576-hvrmt   3/3     Running     0          6h2m
wikifeeds-production-6db5957576-ndb2z   0/3     Evicted     0          6h2m
wikifeeds-production-6db5957576-pnpbh   0/3     Evicted     0          23h
wikifeeds-production-6db5957576-q5vwr   0/3     Evicted     0          23h
wikifeeds-production-6db5957576-qc22x   0/3     Evicted     0          4d3h
wikifeeds-production-6db5957576-qwfqq   0/3     Evicted     0          6h2m
wikifeeds-production-6db5957576-r87sn   0/3     Evicted     0          6h2m
wikifeeds-production-6db5957576-t7lwj   3/3     Running     0          4d3h
wikifeeds-production-6db5957576-tzc2b   0/3     Evicted     0          4d3h
wikifeeds-production-6db5957576-wlhs5   0/3     Evicted     0          4d3h
wikifeeds-production-6db5957576-z7mnk   3/3     Running     0          4d3h
wikifeeds-production-service-checker    0/1     Completed   0          139d
elukey@kubemaster2001:~$ kubectl describe pod wikifeeds-production-577cfd6bd4-6mklk -n wikifeeds | grep -A 1 -i evict
Reason:         Evicted
Message:        The node was low on resource: ephemeral-storage.

Not all namespaces are like this one, but I checked api-gateway and mobileapps and they show the same issue.

Is it normal that pods are in this state? If not, let's investigate and then add an alarm :)

Event Timeline

elukey triaged this task as High priority.

My running theory on this is that shellbox is currently generating a lot of logs (dozens of lines a second) - the file is 12GB on kubernetes2017 atm but could easily be other services

Some evictions actually happened today

  • 11:53:03Z killing 6 of 6 replicas (no idea why the scheduler placed all of them on the same node)
  • 15:24Z Alert was triggered
  • 18:27:12Z killing 4 of 6 replicas
# kubectl -n wikifeeds get po --field-selector=status.phase=Failed -o custom-columns="NAME:.metadata.name,STATUS:.status.reason,TIME:.status.startTime,MSG:.status.message,NODE:.spec.nodeName" | sort -k2
NAME                                    STATUS    TIME                   MSG                                                 NODE
wikifeeds-production-577cfd6bd4-rptj4   Evicted   2021-08-15T22:12:41Z   The node was low on resource: ephemeral-storage.    kubernetes2001.codfw.wmnet
wikifeeds-production-577cfd6bd4-6mklk   Evicted   2021-08-19T08:48:09Z   The node was low on resource: ephemeral-storage.    kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-fd72r   Evicted   2021-09-02T14:36:02Z   The node was low on resource: ephemeral-storage.    kubernetes2004.codfw.wmnet
wikifeeds-production-6db5957576-qc22x   Evicted   2021-09-02T14:36:02Z   The node was low on resource: ephemeral-storage.    kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-wlhs5   Evicted   2021-09-02T14:36:09Z   The node was low on resource: ephemeral-storage.    kubernetes2010.codfw.wmnet
wikifeeds-production-6db5957576-tzc2b   Evicted   2021-09-02T14:36:21Z   The node was low on resource: ephemeral-storage.    kubernetes2017.codfw.wmnet
wikifeeds-production-6db5957576-8fr4v   Evicted   2021-09-05T18:27:15Z   Pod The node had condition: [DiskPressure].         kubernetes2017.codfw.wmnet
wikifeeds-production-6db5957576-pnpbh   Evicted   2021-09-05T18:27:15Z   Pod The node had condition: [DiskPressure].         kubernetes2017.codfw.wmnet
wikifeeds-production-6db5957576-58m6c   Evicted   2021-09-05T18:27:16Z   Pod The node had condition: [DiskPressure].         kubernetes2017.codfw.wmnet
wikifeeds-production-6db5957576-fn56d   Evicted   2021-09-05T18:27:17Z   Pod The node had condition: [DiskPressure].         kubernetes2017.codfw.wmnet
wikifeeds-production-6db5957576-fbbjv   Evicted   2021-09-05T18:27:19Z   Pod The node had condition: [DiskPressure].         kubernetes2017.codfw.wmnet
wikifeeds-production-6db5957576-q5vwr   Evicted   2021-09-05T18:27:21Z   Pod The node had condition: [DiskPressure].         kubernetes2017.codfw.wmnet
wikifeeds-production-6db5957576-d5g6f   Evicted   2021-09-06T11:53:03Z   Pod The node had condition: [DiskPressure].         kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-r87sn   Evicted   2021-09-06T11:53:03Z   Pod The node had condition: [DiskPressure].         kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-ndb2z   Evicted   2021-09-06T11:53:04Z   Pod The node had condition: [DiskPressure].         kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-gbrs9   Evicted   2021-09-06T11:53:06Z   Pod The node had condition: [DiskPressure].         kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-qwfqq   Evicted   2021-09-06T11:53:08Z   Pod The node had condition: [DiskPressure].         kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-7kn2x   Evicted   2021-09-06T11:53:09Z   Pod The node had condition: [DiskPressure].         kubernetes2011.codfw.wmnet
wikifeeds-production-6db5957576-87qxx   Evicted   2021-09-06T18:27:12Z   Pod The node had condition: [DiskPressure].         kubernetes2004.codfw.wmnet
wikifeeds-production-6db5957576-zj697   Evicted   2021-09-06T18:27:12Z   Pod The node had condition: [DiskPressure].         kubernetes2004.codfw.wmnet
wikifeeds-production-6db5957576-h6gcw   Evicted   2021-09-06T18:27:13Z   Pod The node had condition: [DiskPressure].         kubernetes2004.codfw.wmnet
wikifeeds-production-6db5957576-d8ntb   Evicted   2021-09-06T18:27:15Z   Pod The node had condition: [DiskPressure].         kubernetes2004.codfw.wmnet

For what is worth, evictions are not a bad thing per se in kubernetes. They can happen for a variety of reasons, notably:

  • DiskPressure -- Usable disk is running out on the node
  • MemoryPressure -- Memory is running out on the node
  • PIDPressure -- The node is running so many processes that is running out of pids.
  • Hopefully at some point NetworkUnavailable -- Calico is dead pods won't have proper networking connectivity.

In all of these cases and before things become critical, k8s will start evicting pods from the node in an effort to alleviate the situation. The number of pods that will get evicted depends on whether the threshold that was crossed is no longer crossed or not (that is to say, the node will probalby not be emptied). After the pressure is removed, the node becomes Ready again.

So, the TL;DR is that it's usually safe to ignore Evictions when debugging something.

That being said, T290444 has some more information as to the current cause and we probably want to have a better look.

akosiaris claimed this task.

Per the above the answer to

Is it normal that pods are in this state? If not, let's investigate and then add an alarm :)

is "Mostly yes, it's part of the self-healing mechanisms built in kubernetes." with the nuance being that they are a possible indication of an underlying issue that might warrant looking into. But definitely not alert worthy.

I 'll resolve this but feel free to reopen.

Fine for me, what i had in mind was an alert if a namespace showed evicted pods for too much time (say days) since it seemed to some something that could be missed. Ok to close :)

Fine for me, what i had in mind was an alert if a namespace showed evicted pods for too much time (say days) since it seemed to some something that could be missed. Ok to close :)

FWIW they can be around for a long time before kube-controller-manager GCs them. The default is at 12500 evicted pods[1] (evicted pods cost almost nothing btw - just some data in etcd).

[1] https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/#kube-controller-manager