Page MenuHomePhabricator

Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters
Closed, ResolvedPublic

Description

We're currently dedicating 4 nodes (ganeti VMs with 15 vCPUs and 5GB memory each) per DC to run only sessionstore/kask.

AIUI the initial reasoning was that the services does process PII and, during the early days of kubernetes adoption, we felt the need for increased isolation by not running it side-by-side with other containers.
With most of our services now running on kubernetes, the higher level of confidence and the improved tooling around it it might be time to revisit this decision. Removing this snowflake would reduce complexity and a point of failure (which already caused an outage at least once).

I could not find documentaion about the decision process to run on dedicated nodes. If somebody happens to recall details, please share.

Event Timeline

The best I could find is T220821 and T221986.

As far as I am concerned, the approach we took back then, might have been too aggressive and driven by the fact that OCI container escapes were much more common and easy back then. This isn't any more the case and the setup has indeed cause at least a couple of incidents.

  1. https://wikitech.wikimedia.org/wiki/Incidents/2020-06-11_sessionstore%2Bkubernetes
  2. https://wikitech.wikimedia.org/wiki/Incidents/2022-12-13_sessionstore

    and a number of alerts and worry every now and then.

I think the approach isn't useful anymore, we can indeed ditch it.

Change #1097311 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Remove alert for sessionstore not running on dedicated nodes

https://gerrit.wikimedia.org/r/1097311

Change #1097312 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Remove affinity and tolerations from sessionstore deployments

https://gerrit.wikimedia.org/r/1097312

Change #1097311 merged by jenkins-bot:

[operations/alerts@master] Remove alert for sessionstore not running on dedicated nodes

https://gerrit.wikimedia.org/r/1097311

Change #1097312 merged by jenkins-bot:

[operations/deployment-charts@master] Remove affinity and tolerations from sessionstore deployments

https://gerrit.wikimedia.org/r/1097312

Mentioned in SAL (#wikimedia-operations) [2024-11-25T13:43:20Z] <jayme> cordoned kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet - T379599

Mentioned in SAL (#wikimedia-operations) [2024-11-25T13:46:04Z] <jayme> deployed sessionstore to non-dedicated nodes - T379599

Change #1097442 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Decom kubernetes[12]0[01][56] dedicates sessionstore nodes

https://gerrit.wikimedia.org/r/1097442

depool host kubernetes[2005-2006,2015-2016].codfw.wmnet by jayme@cumin2002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host kubernetes[2005-2006,2015-2016].codfw.wmnet completed:

  • kubernetes[2005-2006,2015-2016].codfw.wmnet (PASS)
    • Host kubernetes[2005-2006,2015-2016].codfw.wmnet depooled from wikikube-codfw

depool host kubernetes[1005-1006,1015-1016].eqiad.wmnet by jayme@cumin2002 with reason: None

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host kubernetes[1005-1006,1015-1016].eqiad.wmnet completed:

  • kubernetes[1005-1006,1015-1016].eqiad.wmnet (PASS)
    • Host kubernetes[1005-1006,1015-1016].eqiad.wmnet depooled from wikikube-eqiad

Change #1097442 merged by JMeybohm:

[operations/puppet@production] Decom kubernetes[12]0[01][56] dedicates sessionstore nodes

https://gerrit.wikimedia.org/r/1097442

cookbooks.sre.hosts.decommission executed by jayme@cumin2002 for hosts: kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet

  • kubernetes2005.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubernetes2006.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubernetes2015.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubernetes2016.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • kubernetes1005.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubernetes1006.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubernetes1015.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
  • kubernetes1016.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
JMeybohm claimed this task.