Page MenuHomePhabricator

2024-04-03 calico/typha down
Open, Needs TriagePublic

Description

General description: For approximately 20 minutes, from 13:17 to 13:37 UTC, uncached requests and edits couldn't be served from the eqiad, affecting mainly page edits and read requests in the east of America, Europe, Africa and Asia.
The networking part of the eqiad wikikube control plane (calico) hit resource limits and was OOM-killed, which increased load on the remaining calico-typha pods. After multiple automated restart all calico pods became healthy again.

Status page incident: https://www.wikimediastatus.net/incidents/7qq1gwnw71jy

Tracking task for 2024-04-03 calico/typha down

google doc https://docs.google.com/document/d/1_cgPVSajxMKcN66tCl8ldB38Utl4RORUU3u_SsKXcsI/edit

Event Timeline

jcrespo updated the task description. (Show Details)
fgiunchedi changed the visibility from "Custom Policy" to "All Users".
fgiunchedi changed the edit policy from "Custom Policy" to "All Users".

Does this need to be private?

Not really, should be public now, if it is not feel free to change at will to make it public

taavi changed the visibility from "All Users" to "Public (No Login Required)".Apr 4 2024, 8:18 AM

Change #1017316 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Bump calico-kube-controllers memory limit

https://gerrit.wikimedia.org/r/1017316

Change #1017316 merged by jenkins-bot:

[operations/deployment-charts@master] Bump calico-kube-controllers memory limit

https://gerrit.wikimedia.org/r/1017316

The investigation from last Friday showed the first failed probe for calico-typha-75d4649699-h7vgq was recorded at 13:10:31, which was probably a consequence of the process not being able to allocate additional memory (Apr 3 13:10:28 kubernetes1022 kernel: [18672041.401461] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=99). That typha instance had ~230 client connected at that time (out of ~636) which then must have tried to re-establish connections with one of the remaining 2 typha's, kicking those over the memory limit threshold as well.
After updating the grafana dashboard ab bit it was clean that the client distribution between the 3 typha instances is naturally pretty uneven and so is the memory usage. As the most loaded instance already reaches ~500MiB (from its 600MiB limit) I will increase the limit to 1GiB per instance.

I will also bump the memory requests of typha and kube-controller to match the limit which should prevent the situation where one of those components need to invoke oomk to get hold of additional memory (memory above the requested amount) which I would assume can take too much time under load.

Change #1017777 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calico: Bump typha memory, make calico memory guaranteed

https://gerrit.wikimedia.org/r/1017777

Change #1017777 merged by jenkins-bot:

[operations/deployment-charts@master] calico: Bump typha memory, make calico memory guaranteed

https://gerrit.wikimedia.org/r/1017777

Regarding the number of connections per typha: There is a logic in typha that will balance the connections between instances by disconnecting the ones exceeding a dynamic threshold that is calculated based on the number of typhas and the number of nodes (see https://github.com/projectcalico/calico/blob/v3.23.3/typha/pkg/k8s/rebalance.go#L35).

Apr 8, 2024 @ 09:18:12.085 calico-typha-79c8968977-k8cm7 kubernetes1048 2024-04-08 09:17:46.583 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=420 numNodes=174 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:58.884 calico-typha-79c8968977-gcl74 mw1384 2024-04-08 09:17:58.184 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=420 numNodes=174 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:58.233 calico-typha-79c8968977-tw8fm mw1454 2024-04-08 09:17:52.492 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=420 numNodes=174 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"

Apr 8, 2024 @ 09:17:42.610 calico-typha-79c8968977-wk9fk mw2380 2024-04-08 09:17:09.989 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=404 numNodes=167 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:33.028 calico-typha-79c8968977-m5pbn mw2442 2024-04-08 09:17:04.129 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=404 numNodes=167 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:21.892 calico-typha-79c8968977-nql7x kubernetes2056 2024-04-08 09:16:59.310 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=404 numNodes=167 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"

Is there more to do when it comes to this incident, or could this ticket be closed?