Page MenuHomePhabricator

Bootstrap the dse-k8s-codfw cluster
Closed, ResolvedPublic

Details

Related Changes in Gerrit:
Show related patches Customize query in gerrit

Event Timeline

Gehel triaged this task as High priority.Jun 20 2025, 8:15 AM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Change #1172619 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add dse-k8s-codfw etcd configuration

https://gerrit.wikimedia.org/r/1172619

Change #1172619 merged by Bking:

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw k8s configuration

https://gerrit.wikimedia.org/r/1172619

Change #1173914 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw etcd cluster configuration

https://gerrit.wikimedia.org/r/1173914

Change #1173914 merged by Stevemunene:

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw etcd cluster configuration

https://gerrit.wikimedia.org/r/1173914

Change #1179652 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Update firewall rules to add dse-k8s-codfw

https://gerrit.wikimedia.org/r/1179652

Change #1179652 merged by Stevemunene:

[operations/puppet@production] Update firewall rules to add dse-k8s-codfw

https://gerrit.wikimedia.org/r/1179652

Change #1180512 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw ctrl and worker nodes to role

https://gerrit.wikimedia.org/r/1180512

Change #1180512 merged by Stevemunene:

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw ctrl and worker nodes to role

https://gerrit.wikimedia.org/r/1180512

For the bootstrapping,
Added the hosts to conftool data by merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178834: dse-k8s: add dse-k8s-codfw hosts to LVS then pooled the kubemasters.

stevemunene@puppetserver1001:~$ sudo confctl select 'service=(kubemaster),name=dse-k8s-ctrl2001.codfw.wmnet' set/pooled=yes:weight=1
codfw/dse-k8s/kubemaster/dse-k8s-ctrl2001.codfw.wmnet: pooled changed inactive => yes
codfw/dse-k8s/kubemaster/dse-k8s-ctrl2001.codfw.wmnet: weight changed 0 => 1
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=1; selector: service=(kubemaster),name=dse-k8s-ctrl2001.codfw.wmnet
stevemunene@puppetserver1001:~$ sudo confctl select 'service=(kubemaster),name=dse-k8s-ctrl2002.codfw.wmnet' set/pooled=yes:weight=1
codfw/dse-k8s/kubemaster/dse-k8s-ctrl2002.codfw.wmnet: pooled changed inactive => yes
codfw/dse-k8s/kubemaster/dse-k8s-ctrl2002.codfw.wmnet: weight changed 0 => 1
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=1; selector: service=(kubemaster),name=dse-k8s-ctrl2002.codfw.wmnet
stevemunene@puppetserver1001:~$

Mentioned in SAL (#wikimedia-operations) [2025-08-21T10:37:58Z] <stevemunene@cumin1003> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[2013-2014].codfw.wmnet} and A:lvs (T397301)

Mentioned in SAL (#wikimedia-operations) [2025-08-21T10:49:58Z] <stevemunene@cumin1003> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[2013-2014].codfw.wmnet} and A:lvs (T397301)

Did the same for the kubesvc hosts

stevemunene@puppetserver1001:~$ sudo confctl select 'service=(kubesvc),name=dse-k8s-worker2001.codfw.wmnet' set/pooled=yes:weight=1
codfw/dse-k8s/kubesvc/dse-k8s-worker2001.codfw.wmnet: pooled changed inactive => yes
codfw/dse-k8s/kubesvc/dse-k8s-worker2001.codfw.wmnet: weight changed 0 => 1
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker2001.codfw.wmnet
stevemunene@puppetserver1001:~$ sudo confctl select 'service=(kubesvc),name=dse-k8s-worker2002.codfw.wmnet' set/pooled=yes:weight=1
codfw/dse-k8s/kubesvc/dse-k8s-worker2002.codfw.wmnet: pooled changed inactive => yes
codfw/dse-k8s/kubesvc/dse-k8s-worker2002.codfw.wmnet: weight changed 0 => 1
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker2002.codfw.wmnet
stevemunene@puppetserver1001:~$ sudo confctl select 'service=(kubesvc),name=dse-k8s-worker2003.codfw.wmnet' set/pooled=yes:weight=1
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker2003.codfw.wmnet

Verified with

stevemunene@puppetserver1001:~$ sudo -i confctl --quiet select 'cluster=dse-k8s,dc=codfw' get
{"dse-k8s-ctrl2001.codfw.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=codfw,cluster=dse-k8s,service=kubemaster"}
{"dse-k8s-ctrl2002.codfw.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=codfw,cluster=dse-k8s,service=kubemaster"}
{"dse-k8s-worker2002.codfw.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=codfw,cluster=dse-k8s,service=kubesvc"}
{"dse-k8s-worker2001.codfw.wmnet": {"weight": 1, "pooled": "yes"}, "tags": "dc=codfw,cluster=dse-k8s,service=kubesvc"}

Added the services to LVS as per https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service
stevemunene@cumin1003:~$ sudo cookbook sre.loadbalancer.restart-pybal -r "adding new service dse-k8s-codfw" -t T397301 --query P{lvs[2013-2014].codfw.wmnet}

Labelled the control plane hosts

root@deploy1003:~# kube_env admin dse-k8s-codfw
root@deploy1003:~# kubectl label nodes dse-k8s-ctrl2001.codfw.wmnet node-role.kubernetes.io/control-plane=""
node/dse-k8s-ctrl2001.codfw.wmnet labeled
root@deploy1003:~# kubectl label nodes dse-k8s-ctrl2002.codfw.wmnet node-role.kubernetes.io/control-plane=""
node/dse-k8s-ctrl2002.codfw.wmnet labeled

Networking, cluster configuration and basic services

root@deploy1003:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e dse-k8s-codfw sync
Listing releases matching ^istio-proxy-settings$
Listing releases matching ^istio-gateways-networkpolicies$
Listing releases matching ^istio-gateways-envoyfilters$
Listing releases matching ^knative-serving-crds$
Listing releases matching ^flink-operator-crds$
Listing releases matching ^ceph-csi-rbd$
Listing releases matching ^ceph-csi-cephfs$
Listing releases matching ^cloudnative-pg-crds$
Listing releases matching ^knative-serving$
Listing releases matching ^kserve$
Listing releases matching ^flink-operator$
Listing releases matching ^spark-operator$
Listing releases matching ^k8s-controller-sidecars$
Listing releases matching ^main-opentelemetry-collector$
Listing releases matching ^cloudnative-pg$
Upgrading release=rbac-rules, chart=wmf-stable/raw, namespace=kube-system
Upgrading release=pod-security-policies, chart=wmf-stable/raw, namespace=kube-system
Upgrading release=calico-crds, chart=wmf-stable/calico-crds, namespace=kube-system
Upgrading release=namespaces, chart=wmf-stable/raw, namespace=kube-system
Upgrading release=priority-classes, chart=wmf-stable/raw, namespace=
Release "priority-classes" does not exist. Installing it now.
NAME: priority-classes
LAST DEPLOYED: Thu Aug 21 11:33:22 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^priority-classes$
priority-classes	default  	1       	2025-08-21 11:33:22.052465236 +0000 UTC	deployed	raw-0.3.0	0.2.3      

Release "pod-security-policies" does not exist. Installing it now.
NAME: pod-security-policies
LAST DEPLOYED: Thu Aug 21 11:33:21 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^pod-security-policies$
pod-security-policies	kube-system	1       	2025-08-21 11:33:21.915096804 +0000 UTC	deployed	raw-0.3.0	0.2.3      

Release "rbac-rules" does not exist. Installing it now.
NAME: rbac-rules
LAST DEPLOYED: Thu Aug 21 11:33:22 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^rbac-rules$
rbac-rules	kube-system	1       	2025-08-21 11:33:22.007265886 +0000 UTC	deployed	raw-0.3.0	0.2.3      

Release "calico-crds" does not exist. Installing it now.
NAME: calico-crds
LAST DEPLOYED: Thu Aug 21 11:33:22 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^calico-crds$
calico-crds	kube-system	1       	2025-08-21 11:33:22.077036057 +0000 UTC	deployed	calico-crds-0.2.0	3.23.3     

Release "namespaces" does not exist. Installing it now.
NAME: namespaces
LAST DEPLOYED: Thu Aug 21 11:33:22 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^namespaces$
namespaces	kube-system	1       	2025-08-21 11:33:22.119332257 +0000 UTC	deployed	raw-0.3.0	0.2.3      

skipping missing values file matching "calico/values.yaml"
skipping missing values file matching "/etc/helmfile-defaults/private/admin/calico/dse-k8s-codfw.yaml"
skipping missing values file matching "external-services/values.yaml"
skipping missing values file matching "values/dse-k8s-codfw/external-services-values.yaml"
skipping missing values file matching "/etc/helmfile-defaults/private/admin/external-services/dse-k8s-codfw.yaml"
Upgrading release=cfssl-issuer-crds, chart=wmf-stable/cfssl-issuer-crds, namespace=cert-manager
Upgrading release=calico, chart=wmf-stable/calico, namespace=kube-system
Upgrading release=external-services, chart=wmf-stable/external-services, namespace=external-services
Upgrading release=cert-manager-networkpolicies, chart=wmf-stable/raw, namespace=cert-manager
Release "cert-manager-networkpolicies" does not exist. Installing it now.
NAME: cert-manager-networkpolicies
LAST DEPLOYED: Thu Aug 21 11:33:36 2025
NAMESPACE: cert-manager
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^cert-manager-networkpolicies$
cert-manager-networkpolicies	cert-manager	1       	2025-08-21 11:33:36.943408568 +0000 UTC	deployed	raw-0.3.0	0.2.3      

Release "cfssl-issuer-crds" does not exist. Installing it now.
NAME: cfssl-issuer-crds
LAST DEPLOYED: Thu Aug 21 11:33:36 2025
NAMESPACE: cert-manager
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^cfssl-issuer-crds$
cfssl-issuer-crds	cert-manager	1       	2025-08-21 11:33:36.940871497 +0000 UTC	deployed	cfssl-issuer-crds-0.4.0	0.4.0-1    

Release "external-services" does not exist. Installing it now.
NAME: external-services
LAST DEPLOYED: Thu Aug 21 11:33:36 2025
NAMESPACE: external-services
STATUS: deployed
REVISION: 1
TEST SUITE: None

Listing releases matching ^external-services$
external-services	external-services	1       	2025-08-21 11:33:36.960068306 +0000 UTC	deployed	external-services-0.0.3	           

UPDATED RELEASES:
NAME                           NAMESPACE           CHART                          VERSION   DURATION
priority-classes                                   wmf-stable/raw                 0.3.0           2s
pod-security-policies          kube-system         wmf-stable/raw                 0.3.0           3s
rbac-rules                     kube-system         wmf-stable/raw                 0.3.0           3s
calico-crds                    kube-system         wmf-stable/calico-crds         0.2.0           7s
namespaces                     kube-system         wmf-stable/raw                 0.3.0          15s
cert-manager-networkpolicies   cert-manager        wmf-stable/raw                 0.3.0           3s
cfssl-issuer-crds              cert-manager        wmf-stable/cfssl-issuer-crds   0.4.0           5s
external-services              external-services   wmf-stable/external-services   0.0.3          14s

Verify

root@deploy1003:~# kubect-n kube-system get deployment,daemonset
NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/calico-kube-controllers   0/1     1            0           21s
deployment.apps/calico-typha              0/3     3            0           21s

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/calico-node   4         4         0       4            0           kubernetes.io/os=linux   22s

coredns

The sync seems to be stuck at this point with
NAME: priority-classes (done)
NAME: pod-security-policies (done)
NAME: rbac-rules (done)
NAME: calico-crds (done)
NAME: namespaces (done)
NAME: cert-manager-networkpolicies (done)
NAME: cfssl-issuer-crds (done)
NAME: external-services (done)

Set the BGP value to true for the dse-k8s-codfw ctrl and worker hosts then ran homer from a cumin host

stevemunene@cumin1003:~$ sudo homer "cr*codfw*" diff
INFO:homer.devices:Initialized 105 devices
INFO:homer:Generating diff for query cr*codfw*
INFO:homer.devices:Matched 2 device(s) for query 'cr*codfw*'
INFO:homer:Generating configuration for cr1-codfw.wikimedia.org
INFO:homer.transports.junos:Running commit check on cr1-codfw.wikimedia.org
INFO:homer:Generating configuration for cr2-codfw.wikimedia.org
INFO:homer.transports.junos:Running commit check on cr2-codfw.wikimedia.org
Changes for 1 devices: ['cr1-codfw.wikimedia.org']

[edit policy-options]
+   prefix-list kubedse-pod-ips4 {
+       10.192.96.0/21;
+   }
+   prefix-list kubedse-pod-ips6 {
+       2620:0:860:308::/64;
+   }
[edit policy-options]
+   policy-statement kubedse_import {
+       term pod_ips4 {
+           from {
+               family inet;
+               protocol bgp;
+               prefix-list-filter kubedse-pod-ips4 longer;
+           }
+           then accept;
+       }
+       term pod_ips6 {
+           from {
+               family inet6;
+               protocol bgp;
+               prefix-list-filter kubedse-pod-ips6 longer;
+           }
+           then accept;
+       }
+       then reject;
+   }
[edit protocols bgp]
     group k8s-aux-ipv6 { ... }
+    group Kubedse4 {
+        type external;
+        multihop {
+            /* T328523 */
+            no-nexthop-change;
+        }
+        local-address 208.80.153.192;
+        hold-time 30;
+        /* T328523 */
+        advertise-peer-as;
+        import kubedse_import;
+        family inet {
+            unicast {
+                prefix-limit {
+                    maximum 50;
+                    teardown {
+                        80;
+                        idle-timeout forever;
+                    }
+                }
+            }
+        }
+        /* T328523 */
+        export kubernetes_export;
+        peer-as 64613;
+        multipath;
+        neighbor 10.192.32.6 {
+            description dse-k8s-ctrl2001;
+        }
+        neighbor 10.192.48.13 {
+            description dse-k8s-ctrl2002;
+        }
+    }
+    group Kubedse6 {
+        type external;
+        multihop {
+            /* T328523 */
+            no-nexthop-change;
+        }
+        local-address 2620:0:860:ffff::1;
+        hold-time 30;
+        /* T328523 */
+        advertise-peer-as;
+        import kubedse_import;
+        family inet6 {
+            unicast {
+                prefix-limit {
+                    maximum 50;
+                    teardown {
+                        80;
+                        idle-timeout forever;
+                    }
+                }
+            }
+        }
+        /* T328523 */
+        export kubernetes_export;
+        peer-as 64613;
+        multipath;
+        neighbor 2620:0:860:103:10:192:32:6 {
+            description dse-k8s-ctrl2001;
+        }
+        neighbor 2620:0:860:104:10:192:48:13 {
+            description dse-k8s-ctrl2002;
+        }
+    }

---------------
Changes for 1 devices: ['cr2-codfw.wikimedia.org']

[edit policy-options]
+   prefix-list kubedse-pod-ips4 {
+       10.192.96.0/21;
+   }
+   prefix-list kubedse-pod-ips6 {
+       2620:0:860:308::/64;
+   }
[edit policy-options]
+   policy-statement kubedse_import {
+       term pod_ips4 {
+           from {
+               family inet;
+               protocol bgp;
+               prefix-list-filter kubedse-pod-ips4 longer;
+           }
+           then accept;
+       }
+       term pod_ips6 {
+           from {
+               family inet6;
+               protocol bgp;
+               prefix-list-filter kubedse-pod-ips6 longer;
+           }
+           then accept;
+       }
+       then reject;
+   }
[edit protocols bgp]
     group k8s-aux-ipv6 { ... }
+    group Kubedse4 {
+        type external;
+        multihop {
+            /* T328523 */
+            no-nexthop-change;
+        }
+        local-address 208.80.153.193;
+        hold-time 30;
+        /* T328523 */
+        advertise-peer-as;
+        import kubedse_import;
+        family inet {
+            unicast {
+                prefix-limit {
+                    maximum 50;
+                    teardown {
+                        80;
+                        idle-timeout forever;
+                    }
+                }
+            }
+        }
+        /* T328523 */
+        export kubernetes_export;
+        peer-as 64613;
+        multipath;
+        neighbor 10.192.32.6 {
+            description dse-k8s-ctrl2001;
+        }
+        neighbor 10.192.48.13 {
+            description dse-k8s-ctrl2002;
+        }
+    }
+    group Kubedse6 {
+        type external;
+        multihop {
+            /* T328523 */
+            no-nexthop-change;
+        }
+        local-address 2620:0:860:ffff::2;
+        hold-time 30;
+        /* T328523 */
+        advertise-peer-as;
+        import kubedse_import;
+        family inet6 {
+            unicast {
+                prefix-limit {
+                    maximum 50;
+                    teardown {
+                        80;
+                        idle-timeout forever;
+                    }
+                }
+            }
+        }
+        /* T328523 */
+        export kubernetes_export;
+        peer-as 64613;
+        multipath;
+        neighbor 2620:0:860:103:10:192:32:6 {
+            description dse-k8s-ctrl2001;
+        }
+        neighbor 2620:0:860:104:10:192:48:13 {
+            description dse-k8s-ctrl2002;
+        }
+    }

---------------
INFO:homer:Homer run completed successfully on 2 devices: ['cr1-codfw.wikimedia.org', 'cr2-codfw.wikimedia.org']

Retried the sync but we still seem to be timing out during the calico sync

root@deploy1003:/srv/deployment-charts/helmfile.d/admin_ng# kubectl -n kube-system get deployment,daemonset
NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/calico-kube-controllers   0/1     1            0           20m
deployment.apps/calico-typha              0/3     3            0           20m

NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/calico-node   4         4         0       4            0           kubernetes.io/os=linux   20m
root@deploy1003:/srv/deployment-charts/helmfile.d/admin_ng#
PDATED RELEASES:
NAME                           NAMESPACE           CHART                          VERSION   DURATION
priority-classes                                   wmf-stable/raw                 0.3.0           2s
pod-security-policies          kube-system         wmf-stable/raw                 0.3.0           3s
rbac-rules                     kube-system         wmf-stable/raw                 0.3.0           4s
calico-crds                    kube-system         wmf-stable/calico-crds         0.2.0           6s
namespaces                     kube-system         wmf-stable/raw                 0.3.0          38s
cert-manager-networkpolicies   cert-manager        wmf-stable/raw                 0.3.0           2s
cfssl-issuer-crds              cert-manager        wmf-stable/cfssl-issuer-crds   0.4.0           2s
external-services              external-services   wmf-stable/external-services   0.0.3          29s


FAILED RELEASES:
NAME     NAMESPACE     CHART               VERSION   DURATION
calico   kube-system   wmf-stable/calico              1h0m10s

in ./helmfile.yaml: failed processing release calico: command "/usr/bin/helm3.11" exited with non-zero status:

PATH:
  /usr/bin/helm3.11

ARGS:
  0: helm3.11 (8 bytes)
  1: upgrade (7 bytes)
  2: --install (9 bytes)
  3: calico (6 bytes)
  4: wmf-stable/calico (17 bytes)
  5: --version (9 bytes)
  6: 0.2.10 (6 bytes)
  7: --timeout (9 bytes)
  8: 3600s (5 bytes)
  9: --atomic (8 bytes)
  10: --namespace (11 bytes)
  11: kube-system (11 bytes)
  12: --values (8 bytes)
  13: /tmp/helmfile660272390/kube-system-calico-values-86678f6766 (59 bytes)
  14: --values (8 bytes)
  15: /tmp/helmfile2795767946/kube-system-calico-values-58df9c9d49 (60 bytes)
  16: --values (8 bytes)
  17: /tmp/helmfile4155081487/kube-system-calico-values-7d747b5df4 (60 bytes)
  18: --values (8 bytes)
  19: /tmp/helmfile3135445841/kube-system-calico-values-5d6b7986b5 (60 bytes)
  20: --values (8 bytes)
  21: /tmp/helmfile4267648165/kube-system-calico-values-6bd75857dd (60 bytes)
  22: --reset-values (14 bytes)
  23: --history-max (13 bytes)
  24: 10 (2 bytes)
  25: --kubeconfig=/etc/kubernetes/admin-dse-k8s-codfw.config (55 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  Error: release calico failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

COMBINED OUTPUT:
  Release "calico" does not exist. Installing it now.
  Error: release calico failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

Icinga downtime and Alertmanager silence (ID=c6eb0c23-f05e-45b4-a6a9-b2e46c3fb650) set by stevemunene@cumin1003 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping new dse-k8s-codfw-cluster

dse-k8s-ctrl2001.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=bc6601ce-2c4a-41f7-9f7c-f1a1e791510a) set by stevemunene@cumin1003 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping new dse-k8s-codfw-cluster

dse-k8s-ctrl2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=01fe34a9-8971-4bec-88c1-896cf83621fb) set by stevemunene@cumin1003 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping new dse-k8s-codfw-cluster

dse-k8s-worker2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=ff14ab31-b158-4556-89d2-84238a73e064) set by stevemunene@cumin1003 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping new dse-k8s-codfw-cluster

dse-k8s-worker2001.codfw.wmnet

Change #1183691 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: disable cluster_dns to allow core-dns deploy.

https://gerrit.wikimedia.org/r/1183691

Checking the logs with kubectl logs -n kube-system -l k8s-app=calico-node --tail=100 --all-containers=true
We seem to be missing some bird config

2025-09-02 07:56:18.178 [FATAL][770] confd/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
2025-09-02 07:56:18.623 [ERROR][58] felix/discovery.go 174: Didn't find any ready Typha instances.
2025-09-02 07:56:18.624 [ERROR][58] felix/daemon.go 336: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port

Change #1184059 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts

https://gerrit.wikimedia.org/r/1184059

Change #1184060 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns

https://gerrit.wikimedia.org/r/1184060

Change #1184061 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager

https://gerrit.wikimedia.org/r/1184061

Change #1184059 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s: Upgrade dse-k8s-codfw to v1.31 unpin charts

https://gerrit.wikimedia.org/r/1184059

Change #1184512 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add cumin aliases for dse-k8s in both eqiad and codfw

https://gerrit.wikimedia.org/r/1184512

Change #1184060 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s:Upgrade dse-k8s-codfw to v1.31 deploy latestcoredns

https://gerrit.wikimedia.org/r/1184060

Change #1184061 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s:Upgrade dse-k8s-codfw to v1.31 update certmanager

https://gerrit.wikimedia.org/r/1184061

Change #1184512 merged by Btullis:

[operations/puppet@production] Add cumin aliases for dse-k8s in both eqiad and codfw

https://gerrit.wikimedia.org/r/1184512

Change #1184520 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/cookbooks@master] Add the dse-k8s-codfw cluster to the ks8 cookbooks

https://gerrit.wikimedia.org/r/1184520

Change #1184520 merged by jenkins-bot:

[operations/cookbooks@master] Add the dse-k8s-codfw cluster to the k8s cookbooks

https://gerrit.wikimedia.org/r/1184520

Change #1185056 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove a stray reference to PSP in dse-k8s-codfw

https://gerrit.wikimedia.org/r/1185056

Change #1185056 merged by Btullis:

[operations/puppet@production] Remove a stray reference to PSP in dse-k8s-codfw

https://gerrit.wikimedia.org/r/1185056

We have upgradaed the cluster to version 1.31.4 but we are still seeing an error initializing the calico-node pods. This seems to be BGP related.

2m17s       Normal    SuccessfulCreate         replicaset/calico-typha-7c5b67b766              Created pod: calico-typha-7c5b67b766-q8n2n
2m17s       Normal    SuccessfulCreate         replicaset/calico-typha-7c5b67b766              Created pod: calico-typha-7c5b67b766-2twkn
2m17s       Normal    SuccessfulCreate         replicaset/calico-typha-7c5b67b766              Created pod: calico-typha-7c5b67b766-wvsvw
2m17s       Normal    NoPods                   poddisruptionbudget/calico-typha                No matching pods found
2m17s       Normal    ScalingReplicaSet        deployment/calico-typha                         Scaled up replica set calico-typha-7c5b67b766 to 3
8m36s       Normal    LeaderElection           lease/kube-controller-manager                   dse-k8s-ctrl2001_c2f69a55-c1d6-4798-b5a0-3983ef3cf3d1 became leader
7m14s       Normal    LeaderElection           lease/kube-controller-manager                   dse-k8s-ctrl2001_f227488f-0ba1-4faa-bce4-fac94c35d2f5 became leader
8m37s       Normal    LeaderElection           lease/kube-scheduler                            dse-k8s-ctrl2001_941937fa-a630-4707-a06c-fb166e0eed45 became leader
7m30s       Normal    LeaderElection           lease/kube-scheduler                            dse-k8s-ctrl2001_abe3b3ff-c978-4379-a8d6-109d5c229933 became leader
0s          Warning   Unhealthy                pod/calico-node-lpb56                           (combined from similar events): Readiness probe failed: 2025-09-05 09:48:34.831 [INFO][1236] node/health.go 202: Number of node(s) with BGP peering established = 0...
0s          Warning   Unhealthy                pod/calico-node-mr98p                           (combined from similar events): Readiness probe failed: 2025-09-05 09:48:34.842 [INFO][1246] node/health.go 202: Number of node(s) with BGP peering established = 0...

Change #1185061 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Correct the ASN for the dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1185061

Change #1183691 abandoned by Btullis:

[operations/puppet@production] dse-k8s: disable cluster_dns to allow core-dns deploy.

Reason:

Not needed now.

https://gerrit.wikimedia.org/r/1183691

Change #1185061 merged by jenkins-bot:

[operations/deployment-charts@master] Correct the ASN for the dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1185061

Change #1185066 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the ASN value for the dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1185066

Change #1185066 merged by jenkins-bot:

[operations/deployment-charts@master] Update the ASN value for the dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1185066

It's all installed and upgraded to version 1.31.4.

root@deploy1003:/srv/deployment-charts/helmfile.d/admin_ng# helm list -A
NAME                         	NAMESPACE        	REVISION	UPDATED                                	STATUS  	CHART                              	APP VERSION
calico                       	kube-system      	2       	2025-09-05 11:32:39.731474635 +0000 UTC	deployed	calico-0.4.0                       	3.29.1     
calico-crds                  	kube-system      	4       	2025-09-05 11:31:56.196732418 +0000 UTC	deployed	calico-crds-0.3.0                  	3.29.1     
cert-manager                 	cert-manager     	2       	2025-09-05 11:33:20.357349399 +0000 UTC	deployed	cert-manager-1.16.4                	1.16.3-1   
cert-manager-networkpolicies 	cert-manager     	4       	2025-09-05 11:32:39.694306569 +0000 UTC	deployed	raw-0.3.0                          	0.2.3      
cfssl-issuer                 	cert-manager     	1       	2025-09-05 11:33:32.7674171 +0000 UTC  	deployed	cfssl-issuer-0.4.4                 	0.4.0-1    
cfssl-issuer-crds            	cert-manager     	4       	2025-09-05 11:32:39.733692847 +0000 UTC	deployed	cfssl-issuer-crds-0.4.0            	0.4.0-1    
coredns                      	kube-system      	2       	2025-09-05 11:33:16.989275765 +0000 UTC	deployed	coredns-0.5.0                      	1.11.3     
eventrouter                  	kube-system      	2       	2025-09-05 11:33:19.967151192 +0000 UTC	deployed	eventrouter-0.4.4                  	0.4        
external-services            	external-services	4       	2025-09-05 11:32:39.780736807 +0000 UTC	deployed	external-services-0.0.3            	           
helm-state-metrics           	kube-system      	2       	2025-09-05 11:33:20.050494995 +0000 UTC	deployed	helm-state-metrics-0.2.2           	v0.2.0     
kube-state-metrics           	kube-system      	2       	2025-09-05 11:33:20.094009192 +0000 UTC	deployed	kube-state-metrics-5.10.3          	           
namespace-certificates       	istio-system     	2       	2025-09-05 11:33:32.792246078 +0000 UTC	deployed	raw-0.3.0                          	0.2.3      
namespaces                   	kube-system      	4       	2025-09-05 11:31:56.164724977 +0000 UTC	deployed	raw-0.3.0                          	0.2.3      
priority-classes             	default          	4       	2025-09-05 11:31:55.968242685 +0000 UTC	deployed	raw-0.3.0                          	0.2.3      
rbac-rules                   	kube-system      	4       	2025-09-05 11:31:56.012729696 +0000 UTC	deployed	raw-0.3.0                          	0.2.3      
validating-admission-policies	kube-system      	4       	2025-09-05 11:31:56.048074424 +0000 UTC	deployed	validating-admission-policies-0.3.0	0.1.0
root@deploy1003:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get nodes
NAME                             STATUS   ROLES           AGE    VERSION
dse-k8s-ctrl2001.codfw.wmnet     Ready    control-plane   121m   v1.31.4
dse-k8s-ctrl2002.codfw.wmnet     Ready    control-plane   120m   v1.31.4
dse-k8s-worker2001.codfw.wmnet   Ready    <none>          120m   v1.31.4
dse-k8s-worker2002.codfw.wmnet   Ready    <none>          120m   v1.31.4

Change #1185074 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enabled prometheus support for dse-k8s-codfw

https://gerrit.wikimedia.org/r/1185074

Change #1185703 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add prometheus hosts to scrape dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1185703

Change #1185704 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] grafana: Add dse-k8s-codfw prometheus data source

https://gerrit.wikimedia.org/r/1185704

Change #1185703 abandoned by Stevemunene:

[operations/puppet@production] Add prometheus hosts to scrape dse-k8s-codfw cluster

Reason:

Abandoned in favour of I9a73be9790eb1fb5b3b0fb18d36c1e715aaf386f done earlier

https://gerrit.wikimedia.org/r/1185703

Change #1185704 abandoned by Stevemunene:

[operations/puppet@production] grafana: Add dse-k8s-codfw prometheus data source

Reason:

Abandoned in favour of I9a73be9790eb1fb5b3b0fb18d36c1e715aaf386f done earlier

https://gerrit.wikimedia.org/r/1185704

Change #1185074 merged by Btullis:

[operations/puppet@production] Enable prometheus support for dse-k8s-codfw

https://gerrit.wikimedia.org/r/1185074

The relevant dse-k8s-codfw dashboards are not visible on grafana, we can now close the bootstrap task as done per the criteria.
Next steps are splitting the namespaces in eqiad and codfw and the ceph integration.

https://grafana.wikimedia.org/goto/aU9IBwrNR?orgId=1

image.png (2×3 px, 503 KB)

Change #1188296 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove some services from dse-k8s-codfw that should not be deployed

https://gerrit.wikimedia.org/r/1188296

Change #1188296 abandoned by Btullis:

[operations/puppet@production] Remove some services from dse-k8s-codfw that should not be deployed

Reason:

Achieved in a different manner

https://gerrit.wikimedia.org/r/1188296