Page MenuHomePhabricator

Investigate whether running >1 replicas of calico-typha is feasible and prudent
Open, LowPublic

Description

From 2021-09-29_eqiad-kubernetes#Actionables

We should investigate if it is feasible and prudent to increase the availability of Typha by increasing the number of replicas. We started with 1 to keep complexity about this new component low, but now we have enough experience (and an outage!) to warrant investigating this more.

Event Timeline

akosiaris triaged this task as Medium priority.Sep 29 2021, 2:57 PM

Docs say:

We recommend at least one replica for every 200 nodes, and no more than 20 replicas. In production, we recommend a minimum of three replicas to reduce the impact of rolling upgrades and failures. The number of replicas should always be less than the number of nodes, otherwise rolling upgrades will stall. In addition, Typha only helps with scale if there are fewer Typha instances than there are nodes.

So, 3 replicas it is!

Change 724957 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] Calico: Increase replicaCount for typha

https://gerrit.wikimedia.org/r/724957

Change 724957 merged by Alexandros Kosiaris:

[operations/deployment-charts@master] Calico: Increase replicaCount for typha

https://gerrit.wikimedia.org/r/724957

akosiaris lowered the priority of this task from Medium to Low.Sep 30 2021, 1:41 PM
akosiaris added subscribers: klausman, elukey.

services/eqiad and services/codfw clusters are now running 3 instances of calico typha. staging/eqiad and staging/codfw are running 1 instance as we only have 2 nodes there and the docs say that it will actively harmful if #typha_instances >= #nodes.

Adding @elukey and @klausman as they might want to bump the number of typha instances on the ml-serve cluster (they got 4 nodes IIRC).

As far as the actionable of the aforementioned incident is concerned, this is done.

Change 725289 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: increase Typha's replicas for ml-serve clusters

https://gerrit.wikimedia.org/r/725289

Change 725289 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: increase Typha's replicas for ml-serve clusters

https://gerrit.wikimedia.org/r/725289