Page MenuHomePhabricator

Add missing failure domain labels to ml-serve-* clusters
Closed, ResolvedPublic

Description

In the parent task it was discussed that the ml-serve clusters should have labels like:

failure-domain.beta.kubernetes.io/region: {eqiad,codfw}
failure-domain.beta.kubernetes.io/zone: row-{a,b,c,d,e1,e2,....}

The zone will be mixed between "old" per-row redundancy scheme (row-{a,b,c,d}) and new per-rack scheme of row E/F (row-{e1,e2,e3,..,f1,f2..}). Since we cannot add the labels via puppet, because in k8s 1.16 node labels can be added only upon the first run of the Kubelet, we'll have to add them manually and possibly add a comment in puppet to remember to properly add them when we'll migrate to the new k8s version (the labels are deprecated in the new version so we'll need something new of course).

Event Timeline

root@deploy1002:~# kubectl label nodes ml-serve1001.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1001.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1002.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1002.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1003.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1003.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1004.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1004.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1005.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1005.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1006.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1006.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1007.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1007.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1008.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1008.eqiad.wmnet labeled

root@deploy1002:~# kubectl label nodes ml-serve1001.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-a
node/ml-serve1001.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1002.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-b
node/ml-serve1002.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1003.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-c
node/ml-serve1003.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1004.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-d
node/ml-serve1004.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1005.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-e2
node/ml-serve1005.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1006.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-e3
node/ml-serve1006.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1007.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-f2
node/ml-serve1007.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1008.eqiad.wmnet failure-domain.beta.kubernetes.io/zone=row-f3
node/ml-serve1008.eqiad.wmnet labeled

Haven't added anything to Ganeti nodes since it needs to be discussed in the parent task first. Codfw is missing too but we can do it as soon as we have a final/good/tested setup in ml-serve-eqiad in my opinion.

Change 792232 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Reduce the scope of Calico's global BGP Peers for ml-serve-eqiad

https://gerrit.wikimedia.org/r/792232

Change 792232 merged by Elukey:

[operations/deployment-charts@master] Reduce the scope of Calico's global BGP Peers for ml-serve-eqiad

https://gerrit.wikimedia.org/r/792232

Change 792611 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Allow BGP from calico pods running on master nodes on ml-serve-eqiad

https://gerrit.wikimedia.org/r/792611

Change 792611 merged by Elukey:

[operations/deployment-charts@master] Allow BGP from calico pods running on master nodes on ml-serve-eqiad

https://gerrit.wikimedia.org/r/792611

root@deploy1002:~# kubectl label nodes ml-serve2001.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2001.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2002.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2002.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2003.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2003.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2004.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2004.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2005.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2005.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2006.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2006.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2007.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2007.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2008.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2008.codfw.wmnet labeled

root@deploy1002:~# kubectl label nodes ml-serve2001.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-a
node/ml-serve2001.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2002.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-b
node/ml-serve2002.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2003.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-c
node/ml-serve2003.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2004.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-d
node/ml-serve2004.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2005.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-a
node/ml-serve2005.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2006.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-b
node/ml-serve2006.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2006.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-b
node/ml-serve2001.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2007.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-c
node/ml-serve2007.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2008.codfw.wmnet failure-domain.beta.kubernetes.io/zone=row-d
node/ml-serve2008.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve-ctrl2001.codfw.wmnet node-role.kubernetes.io/master=""
node/ml-serve-ctrl2001.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve-ctrl2002.codfw.wmnet node-role.kubernetes.io/master=""
node/ml-serve-ctrl2002.codfw.wmnet labeled

Change 792970 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add comments related to kubelet labels used in ml-serve clusters

https://gerrit.wikimedia.org/r/792970

Change 792970 merged by Elukey:

[operations/puppet@production] Add kubelet labels used in ml-serve clusters

https://gerrit.wikimedia.org/r/792970

Change 793059 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::master: remove node labels for kubelet

https://gerrit.wikimedia.org/r/793059

Change 793059 merged by Elukey:

[operations/puppet@production] role::ml_k8s::master: remove node labels for kubelet

https://gerrit.wikimedia.org/r/793059