2024-04-03 calico/typha down
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Apr 3 2024, 1:50 PM

Description

General description: For approximately 20 minutes, from 13:17 to 13:37 UTC, uncached requests and edits couldn't be served from the eqiad, affecting mainly page edits and read requests in the east of America, Europe, Africa and Asia.
The networking part of the eqiad wikikube control plane (calico) hit resource limits and was OOM-killed, which increased load on the remaining calico-typha pods. After multiple automated restart all calico pods became healthy again.

Status page incident: https://www.wikimediastatus.net/incidents/7qq1gwnw71jy

Tracking task for 2024-04-03 calico/typha down

google doc https://docs.google.com/document/d/1_cgPVSajxMKcN66tCl8ldB38Utl4RORUU3u_SsKXcsI/edit

Details

	Subject	Repo	Branch	Lines +/-
	calico: Bump typha memory, make calico memory guaranteed	operations/deployment-charts	master	+6 -6
	Bump calico-kube-controllers memory limit	operations/deployment-charts	master	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T361706 2024-04-03 calico/typha down
Resolved	jijiki	T361720 Helm was left in limbo due to interrupted deployment/rollback
Open	None	T361724 scap should check if it is running within a tmux/screen

Event Timeline

fgiunchedi created this task.Apr 3 2024, 1:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 3 2024, 1:50 PM

Jelto mentioned this in T361705: ProbeDown - miscweb1003.Apr 3 2024, 1:51 PM

jcrespo updated the task description. (Show Details)Apr 3 2024, 2:06 PM

jcrespo updated the task description. (Show Details)

Jelto updated the task description. (Show Details)Apr 3 2024, 2:19 PM

Deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1016794 to increase Typha memory limits

jijiki closed subtask T361720: Helm was left in limbo due to interrupted deployment/rollback as Resolved.Apr 3 2024, 4:43 PM

Does this need to be private?

In T361706#9685435, @taavi wrote:

Does this need to be private?

Not really, should be public now, if it is not feel free to change at will to make it public

taavi changed the visibility from "All Users" to "Public (No Login Required)".Apr 4 2024, 8:18 AM

JMeybohm mentioned this in T361728: SwaggerProbeHasFailures for citoid since last deployment.Apr 4 2024, 11:21 AM

taavi added a project: Prod-Kubernetes.Apr 4 2024, 12:10 PM

akosiaris subscribed.Apr 4 2024, 3:12 PM

dancy reopened subtask T361720: Helm was left in limbo due to interrupted deployment/rollback as Open.Apr 4 2024, 6:28 PM

dancy closed subtask T361720: Helm was left in limbo due to interrupted deployment/rollback as Resolved.Apr 5 2024, 2:57 PM

Change #1017316 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Bump calico-kube-controllers memory limit

https://gerrit.wikimedia.org/r/1017316

gerritbot added a project: Patch-For-Review.Apr 5 2024, 3:48 PM

Change #1017316 merged by jenkins-bot:

[operations/deployment-charts@master] Bump calico-kube-controllers memory limit

https://gerrit.wikimedia.org/r/1017316

The investigation from last Friday showed the first failed probe for calico-typha-75d4649699-h7vgq was recorded at 13:10:31, which was probably a consequence of the process not being able to allocate additional memory (Apr 3 13:10:28 kubernetes1022 kernel: [18672041.401461] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=99). That typha instance had ~230 client connected at that time (out of ~636) which then must have tried to re-establish connections with one of the remaining 2 typha's, kicking those over the memory limit threshold as well.
After updating the grafana dashboard ab bit it was clean that the client distribution between the 3 typha instances is naturally pretty uneven and so is the memory usage. As the most loaded instance already reaches ~500MiB (from its 600MiB limit) I will increase the limit to 1GiB per instance.

I will also bump the memory requests of typha and kube-controller to match the limit which should prevent the situation where one of those components need to invoke oomk to get hold of additional memory (memory above the requested amount) which I would assume can take too much time under load.

Change #1017777 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calico: Bump typha memory, make calico memory guaranteed

https://gerrit.wikimedia.org/r/1017777

Change #1017777 merged by jenkins-bot:

[operations/deployment-charts@master] calico: Bump typha memory, make calico memory guaranteed

https://gerrit.wikimedia.org/r/1017777

Regarding the number of connections per typha: There is a logic in typha that will balance the connections between instances by disconnecting the ones exceeding a dynamic threshold that is calculated based on the number of typhas and the number of nodes (see https://github.com/projectcalico/calico/blob/v3.23.3/typha/pkg/k8s/rebalance.go#L35).

Apr 8, 2024 @ 09:18:12.085 calico-typha-79c8968977-k8cm7 kubernetes1048 2024-04-08 09:17:46.583 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=420 numNodes=174 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:58.884 calico-typha-79c8968977-gcl74 mw1384 2024-04-08 09:17:58.184 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=420 numNodes=174 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:58.233 calico-typha-79c8968977-tw8fm mw1454 2024-04-08 09:17:52.492 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=420 numNodes=174 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"

Apr 8, 2024 @ 09:17:42.610 calico-typha-79c8968977-wk9fk mw2380 2024-04-08 09:17:09.989 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=404 numNodes=167 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:33.028 calico-typha-79c8968977-m5pbn mw2442 2024-04-08 09:17:04.129 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=404 numNodes=167 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"
Apr 8, 2024 @ 09:17:21.892 calico-typha-79c8968977-nql7x kubernetes2056 2024-04-08 09:16:59.310 [INFO][7] rebalance.go 77: Calculated new connection limit. newLimit=404 numNodes=167 numSyncerTypes=4 numTyphas=3 reason="fraction+20%" thread="k8s-poll"

Is there more to do when it comes to this incident, or could this ticket be closed?

Maintenance_bot removed a project: Patch-For-Review.Mon, Apr 29, 10:30 AM

2024-04-03 calico/typha downOpen, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

2024-04-03 calico/typha down
Open, Needs TriagePublic
Actions

Related Objects
Search...