Page MenuHomePhabricator

Bring dse-k8s-worker2003.codfw.wmnet into production
Closed, ResolvedPublic

Description

dse-k8s-worker2003.codfw.wmnet is part of the cluster, but not yet active:

kubectl get nodes
NAME                             STATUS                     ROLES           AGE   VERSION
dse-k8s-ctrl2001.codfw.wmnet     Ready                      control-plane   25d   v1.31.4
dse-k8s-ctrl2002.codfw.wmnet     Ready                      control-plane   25d   v1.31.4
dse-k8s-worker2001.codfw.wmnet   Ready                      <none>          25d   v1.31.4
dse-k8s-worker2002.codfw.wmnet   Ready                      <none>          25d   v1.31.4
dse-k8s-worker2003.codfw.wmnet   Ready,SchedulingDisabled   <none>          16h   v1.31.4

Creating this ticket to:

  • Bring dse-k8s-worker2003.codfw.wmnet into production
  • Verify operation

Details

Event Timeline

BTullis triaged this task as Medium priority.

I have enabled the BGP flag for the host in netbox and the homer diff for the top of rack switch looks good.

btullis@cumin1003:~$ sudo homer "lsw1-b5-codfw*" diff
WARNING:homer.capirca:Netbox capirca.GetHosts script is > 3 days old.
INFO:homer.devices:Initialized 109 devices
INFO:homer:Generating diff for query lsw1-b5-codfw*
INFO:homer:Gathering global Netbox data
INFO:homer.devices:Matched 1 device(s) for query 'lsw1-b5-codfw*'
INFO:homer:Generating configuration for lsw1-b5-codfw.mgmt.codfw.wmnet
INFO:homer.transports.junos:Running commit check on lsw1-b5-codfw.mgmt.codfw.wmnet
Changes for 1 devices: ['lsw1-b5-codfw.mgmt.codfw.wmnet']

[edit routing-instances PRODUCTION protocols bgp]
       group Kubemlserve6 { ... }
+      group Kubedse4 {
+          type external;
+          hold-time 30;
+          import kubedse_import;
+          family inet {
+              unicast {
+                  prefix-limit {
+                      maximum 50;
+                      teardown {
+                          80;
+                          idle-timeout forever;
+                      }
+                  }
+              }
+          }
+          /* T328523 */
+          export kubernetes_export;
+          peer-as 64613;
+          local-as 14907 loops 2 private no-prepend-global-as;
+          multipath;
+          neighbor 10.192.14.6 {
+              description dse-k8s-worker2003;
+          }
+      }
+      group Kubedse6 {
+          type external;
+          hold-time 30;
+          import kubedse_import;
+          family inet6 {
+              unicast {
+                  prefix-limit {
+                      maximum 50;
+                      teardown {
+                          80;
+                          idle-timeout forever;
+                      }
+                  }
+              }
+          }
+          /* T328523 */
+          export kubernetes_export;
+          peer-as 64613;
+          local-as 14907 loops 2 private no-prepend-global-as;
+          multipath;
+          neighbor 2620:0:860:10f:10:192:14:6 {
+              description dse-k8s-worker2003;
+          }
+      }

---------------
INFO:homer:Homer run completed successfully on 1 devices: ['lsw1-b5-codfw.mgmt.codfw.wmnet']

Committed the change.

Uncordoned the host.

root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl uncordon dse-k8s-worker2003.codfw.wmnet
node/dse-k8s-worker2003.codfw.wmnet uncordoned
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get nodes
NAME                             STATUS   ROLES           AGE   VERSION
dse-k8s-ctrl2001.codfw.wmnet     Ready    control-plane   25d   v1.31.4
dse-k8s-ctrl2002.codfw.wmnet     Ready    control-plane   25d   v1.31.4
dse-k8s-worker2001.codfw.wmnet   Ready    <none>          25d   v1.31.4
dse-k8s-worker2002.codfw.wmnet   Ready    <none>          25d   v1.31.4
dse-k8s-worker2003.codfw.wmnet   Ready    <none>          16h   v1.31.4

All of the kube-system pods are now running correctly. The calico-node pod was crashlooping until the BGP change was made to the switch.

root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get pods --all-namespaces -owide|grep dse-k8s-worker2003
kube-system             calico-node-v4bqn                                               1/1     Running   288 (7m11s ago)   17h    10.192.14.6     dse-k8s-worker2003.codfw.wmnet   <none>           <none>
kube-system             calico-typha-7c5b67b766-g4fbr                                   1/1     Running   0                 25d    10.192.14.6     dse-k8s-worker2003.codfw.wmnet   <none>           <none>
kube-system             ceph-csi-cephfs-nodeplugin-km8gf                                2/2     Running   0                 17h    10.192.14.6     dse-k8s-worker2003.codfw.wmnet   <none>           <none>
kube-system             ceph-csi-cephfs-provisioner-5f8874f66-hdd8j                     4/4     Running   0                 21d    10.192.102.64   dse-k8s-worker2003.codfw.wmnet   <none>           <none>
kube-system             ceph-csi-rbd-nodeplugin-kptqg                                   2/2     Running   0                 17h    10.192.14.6     dse-k8s-worker2003.codfw.wmnet   <none>           <none>
kube-system             ceph-csi-rbd-provisioner-69c74d89cc-4v4q7                       6/6     Running   4 (118s ago)      21d    10.192.102.65   dse-k8s-worker2003.codfw.wmnet   <none>           <none>
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng#

Roll-reboot of nodes in dse-codfw cluster started by btullis:

  • dse-k8s-worker[2001-2003].codfw.wmnet
BTullis updated the task description. (Show Details)

It's a bit difficult to validate that everything is working when we haven't yet got any workload, but it seems OK.
I will resolve for now, but revisit if we have any issues with it down the line.

Change #1198329 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add dse-k8s-worker2003 to the kubesvc pool

https://gerrit.wikimedia.org/r/1198329

Change #1198329 merged by Btullis:

[operations/puppet@production] Add dse-k8s-worker2003 to the kubesvc pool

https://gerrit.wikimedia.org/r/1198329