We should investigate if it is feasible and prudent to increase the availability of Typha by increasing the number of replicas. We started with 1 to keep complexity about this new component low, but now we have enough experience (and an outage!) to warrant investigating this more.
We recommend at least one replica for every 200 nodes, and no more than 20 replicas. In production, we recommend a minimum of three replicas to reduce the impact of rolling upgrades and failures. The number of replicas should always be less than the number of nodes, otherwise rolling upgrades will stall. In addition, Typha only helps with scale if there are fewer Typha instances than there are nodes.
So, 3 replicas it is!
services/eqiad and services/codfw clusters are now running 3 instances of calico typha. staging/eqiad and staging/codfw are running 1 instance as we only have 2 nodes there and the docs say that it will actively harmful if #typha_instances >= #nodes.
As far as the actionable of the aforementioned incident is concerned, this is done.