Followup from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes. We were unable to schedule more pods on the dedicated sessionstore kubernetes nodes due to CPU constraints during the incident. That prolonged needlessly the incident. Those nodes, while being part of the main kubernetes cluster, have such taints that only accept the sessionstore pods. Currently those are 2 VMs per DC.
There are 2 different paths that we can follow here:
- Increase the CPU available to the nodes
- Increase the number of nodes.
Both can be followed independently. In fact, given the crucial nature of sessionstore, it's prudent that both are followed. The first will allow to schedule more kask pods on the nodes, the latter will have the added benefit of adding more network availability zones, thereby increasing tolerance to network incidents.