Page MenuHomePhabricator

Issues deploying calico to ml-staging-codfw and aux-k8s-eqiad
Closed, ResolvedPublic

Description

Deploying a calico update (T306649) was troublesome and needed manual intervention on ml-staging-codfw and aux-k8s-eqiad.

calico-node was deploying fine, but calico-typha hat issues being scheduled:

ml-staging:

0/4 nodes are available: 2 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

aux:

0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

wikikube-staging clusters do only run 2 workers as well and had no issues, so there probably is some difference in configuration.

Event Timeline

JMeybohm created this task.
JMeybohm renamed this task from Issues deploying calico to ml-staging-codfw and aux to Issues deploying calico to ml-staging-codfw and aux-k8s-eqiad.Mar 28 2023, 8:36 AM

Change 903616 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: lower the typha pods to 1 in ml-staging-codfw

https://gerrit.wikimedia.org/r/903616

Change 903616 merged by Elukey:

[operations/deployment-charts@master] admin_ng: lower the typha pods to 1 in ml-staging-codfw

https://gerrit.wikimedia.org/r/903616

just deployed calico on ml-staging-codfw with typha replica count 1 and it worked nicely.

Change 948994 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] aux: set calico typha to two replicas

https://gerrit.wikimedia.org/r/948994

Change 948994 merged by Filippo Giunchedi:

[operations/deployment-charts@master] aux: set calico typha to one replica

https://gerrit.wikimedia.org/r/948994

Change 949002 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Remove podAntiAffinity for calico-typha on aux

https://gerrit.wikimedia.org/r/949002

Change 948597 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Revert "aux: set calico typha to one replica"

https://gerrit.wikimedia.org/r/948597

Change 949002 merged by Filippo Giunchedi:

[operations/deployment-charts@master] Remove podAntiAffinity for calico-typha on aux

https://gerrit.wikimedia.org/r/949002

Change 948597 merged by Filippo Giunchedi:

[operations/deployment-charts@master] Revert "aux: set calico typha to one replica"

https://gerrit.wikimedia.org/r/948597

In aux the calico deloyment failed because the cluster is not row redundant and typha has a pod anti affinity to not be scheduled twice in the same zone (row). To get out of that situation we had to remove the anti affinity an delete the old pod as well as it's replicaset during deployment (kubectl -n kube-system delete rs/calico-typha-6658db8b77 pod/calico-typha-6658db8b77-kqpx7) as the already deployed anti affinity rule still prevents typha to be sheduled.

JMeybohm claimed this task.