Page MenuHomePhabricator

Relax nodeAffinity of sessionstore
Closed, ResolvedPublic

Description

As a follow up for T325056

We currently pin sessionstore to dedicated nodes for isolation:

nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: dedicated
            operator: In
            values:
              - kask

This means that sessionstore pods will become unshedulable of there is no dedicated kask node available.

We should relax the affinity from requiredDuringSchedulingIgnoredDuringExecution to preferredDuringSchedulingIgnoredDuringExecution to prefer a degradation of separation for some time over a outage. This should be combined with a proper alerting rule that fires when sessionstore gets scheduled on non-dedicated nodes to a human can intervene and clear the situation right away, restoring separation.

Event Timeline

JMeybohm triaged this task as Medium priority.Dec 14 2022, 9:59 AM
JMeybohm created this task.

Note that we also have taints on the dedicated to sessionstore nodes (albeit marked as kask, to avoid having other things scheduled on those nodes). This is actually tangentially related, and not really interfering with this task, but noting it down for completeness

Change 901572 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/deployment-charts@master] Relax nodeAffinity of sessionstore pods

https://gerrit.wikimedia.org/r/901572

Change 902052 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/alerts@master] Alert on sessionstore scheduling on non-dedicated k8s hosts

https://gerrit.wikimedia.org/r/902052

Change 902052 merged by jenkins-bot:

[operations/alerts@master] Alert on sessionstore scheduling on non-dedicated k8s hosts

https://gerrit.wikimedia.org/r/902052

Change 901572 merged by EoghanGaffney:

[operations/deployment-charts@master] Relax nodeAffinity of sessionstore pods

https://gerrit.wikimedia.org/r/901572

Change 902114 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/deployment-charts@master] Fix preferredDuringScheduling[...] change for sessionstore

https://gerrit.wikimedia.org/r/902114

Change 902114 merged by EoghanGaffney:

[operations/deployment-charts@master] Fix preferredDuringScheduling[...] change for sessionstore

https://gerrit.wikimedia.org/r/902114

We've deployed the change to relax the nodeAffinity setting, tomorrow morning we'll drain one of the nodes to test that the alert works correctly and we can close this ticket.

image.png (334×894 px, 73 KB)

We have an alert to catch the condition where a pod gets scheduled on a non-dedicated host. Future work would be to create a task from this alert, rather than an alarm, but I think we can close this for now!