Page MenuHomePhabricator

Increase capacity of the sessionstore dedicated kubernetes nodes
Closed, ResolvedPublic

Description

Followup from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes. We were unable to schedule more pods on the dedicated sessionstore kubernetes nodes due to CPU constraints during the incident. That prolonged needlessly the incident. Those nodes, while being part of the main kubernetes cluster, have such taints that only accept the sessionstore pods. Currently those are 2 VMs per DC.

There are 2 different paths that we can follow here:

  • Increase the CPU available to the nodes
  • Increase the number of nodes.

Both can be followed independently. In fact, given the crucial nature of sessionstore, it's prudent that both are followed. The first will allow to schedule more kask pods on the nodes, the latter will have the added benefit of adding more network availability zones, thereby increasing tolerance to network incidents.

Event Timeline

Currently, sessionstore sets a limit of 400Mi and 2.5 CPUs[1]. Memory wise, the nodes have 4GB RAM and 6 CPUs. The easy win here is to increase the amount of the vCPUs from 6 to 15. That should allow for a 200% increase in the amount of pods the node can run (from 2 to 6). The amount of memory available supports already 10 pods in theory, in practice since the node also consumes some memory it's more like ~8, which makes the 2 resources pretty close to each other.

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/services/eqiad/sessionstore/values.yaml#99

Mentioned in SAL (#wikimedia-operations) [2020-06-24T12:11:02Z] <akosiaris> depool kubernetes2005,kubernetes2006 for CPU capacity increase T256236

Mentioned in SAL (#wikimedia-operations) [2020-06-24T12:14:28Z] <akosiaris> reboot kubernetes2005,6 for CPU capacity increase T256236

Mentioned in SAL (#wikimedia-operations) [2020-06-24T12:17:53Z] <akosiaris> depool/drain/reboot/pool kubernetes1005,6 for CPU capacity increase T256236

Change 607495 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Introduce kubernetes[12]01[56]

https://gerrit.wikimedia.org/r/607495

Change 607495 merged by Alexandros Kosiaris:
[operations/dns@master] Introduce kubernetes[12]01[56]

https://gerrit.wikimedia.org/r/607495

Change 607752 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Introduce kubernetes[12]01[56]

https://gerrit.wikimedia.org/r/607752

Change 607752 merged by Alexandros Kosiaris:
[operations/puppet@production] Introduce kubernetes[12]01[56]

https://gerrit.wikimedia.org/r/607752

Change 607754 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/homer/public@master] Add kubernetes[12]01[56]

https://gerrit.wikimedia.org/r/607754

Change 607754 merged by jenkins-bot:
[operations/homer/public@master] Add kubernetes[12]01[56]

https://gerrit.wikimedia.org/r/607754

Mentioned in SAL (#wikimedia-operations) [2020-06-26T08:04:28Z] <akosiaris> pool all new kubernetes nodes in LVS T252185 T256236

akosiaris added a subscriber: wkandek.

Both paths outlined in the description have been followed. We now have 4 different sessionstore dedicated nodes in each DC, with capacity of 6 kask pods each for a total of 24pods. That is 6 times larger than the capacity we had when the incident was triggered and 3 times larger than the capacity we now have allocated. So we have room to react in a sudden increase, hopefully in the future even automatically.