Page MenuHomePhabricator

aux-k8s-codfw cluster setup
Closed, ResolvedPublic

Description

High level checklist for aux k8s codfw cluster setup (based on https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New)

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+2 -1
operations/puppetproduction+4 -2
operations/dnsmaster+5 -0
operations/puppetproduction+4 -0
operations/dnsmaster+3 -0
operations/puppetproduction+4 -35
operations/puppetproduction+11 -0
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+14 -0
operations/dnsmaster+27 -4
operations/puppetproduction+31 -1
operations/puppetproduction+4 -35
operations/puppetproduction+4 -45
operations/dnsmaster+4 -3
operations/alertsmaster+4 -2
operations/puppetproduction+90 -2
operations/puppetproduction+0 -3
operations/puppetproduction+3 -0
operations/dnsmaster+8 -0
operations/puppetproduction+2 -1
operations/deployment-chartsmaster+21 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1100153 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] add aux-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1100153

CDanis triaged this task as Medium priority.Dec 16 2024, 3:20 PM

Change #1100153 merged by jenkins-bot:

[operations/deployment-charts@master] add aux-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1100153

Change #1116825 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes

https://gerrit.wikimedia.org/r/1116825

Change #1116825 merged by Herron:

[operations/puppet@production] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes

https://gerrit.wikimedia.org/r/1116825

Change #1116825 merged by Herron:

[operations/puppet@production] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes

https://gerrit.wikimedia.org/r/1116825

Missed adding the new SRV records ahead of merging the puppet patch, will work on that next

Feb 03 18:47:22 aux-k8s-etcd2003 etcd[653149]: couldn't resolve during SRV discovery (error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.aux-k8s-etcd.codfw.wmnet on 10.3.0.1:53: no such host)
Feb 03 18:47:22 aux-k8s-etcd2003 etcd[653149]: error setting up initial cluster: error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.aux-k8s-etcd.codfw.wmnet on 10.3.0.1:53: no such host
Feb 03 18:47:22 aux-k8s-etcd2003 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Feb 03 18:47:22 aux-k8s-etcd2003 systemd[1]: etcd.service: Failed with result 'exit-code'.
Feb 03 18:47:22 aux-k8s-etcd2003 systemd[1]: Failed to start etcd.service - etcd - highly-available key value store.

Change #1116867 had a related patch set uploaded (by Herron; author: Herron):

[operations/dns@master] wmnet: add codfw aux-k8s-etcd SRV records

https://gerrit.wikimedia.org/r/1116867

Change #1116867 merged by Herron:

[operations/dns@master] wmnet: add codfw aux-k8s-etcd SRV records

https://gerrit.wikimedia.org/r/1116867

Change #1117219 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s-etcd: bootstrap codfw cluster

https://gerrit.wikimedia.org/r/1117219

Change #1117219 merged by Herron:

[operations/puppet@production] aux-k8s-etcd: bootstrap codfw cluster

https://gerrit.wikimedia.org/r/1117219

Change #1117235 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s-etcd: set bootstrap false

https://gerrit.wikimedia.org/r/1117235

Change #1117235 merged by Herron:

[operations/puppet@production] aux-k8s-etcd: set bootstrap false

https://gerrit.wikimedia.org/r/1117235

Change #1122170 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s-ctrl codfw: apply role

https://gerrit.wikimedia.org/r/1122170

Change #1122170 merged by Herron:

[operations/puppet@production] aux-k8s-ctrl codfw: apply role

https://gerrit.wikimedia.org/r/1122170

Change #1123426 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s-ctrl codfw: enable lvs

https://gerrit.wikimedia.org/r/1123426

Change #1123434 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s-worker: deploy role to codfw workers

https://gerrit.wikimedia.org/r/1123434

Change #1124179 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s codfw: enable worker ingress

https://gerrit.wikimedia.org/r/1124179

Change #1123434 merged by Herron:

[operations/puppet@production] aux-k8s-worker: deploy role to codfw workers

https://gerrit.wikimedia.org/r/1123434

Change #1124453 had a related patch set uploaded (by Herron; author: Herron):

[operations/alerts@master] KubernetesRsyslogDown: bump threshold to 15m

https://gerrit.wikimedia.org/r/1124453

Change #1124453 abandoned by Herron:

[operations/alerts@master] KubernetesRsyslogDown: alert only if logs were sent before

Reason:

going with doc/process update instead

https://gerrit.wikimedia.org/r/1124453

Change #1126093 had a related patch set uploaded (by Herron; author: Herron):

[operations/dns@master] dns: add aux-k8s ingress/ctrl vips

https://gerrit.wikimedia.org/r/1126093

Change #1126093 merged by Herron:

[operations/dns@master] dns: add aux-k8s ingress/ctrl vips

https://gerrit.wikimedia.org/r/1126093

Change #1123426 merged by Herron:

[operations/puppet@production] aux-k8s-ctrl codfw: enable lvs

https://gerrit.wikimedia.org/r/1123426

Mentioned in SAL (#wikimedia-operations) [2025-03-10T17:45:55Z] <sukhe> sudo cumin 'A:lvs-codfw' 'disable-puppet "adding k8s-ingress-aux codfw"'T381417

Change #1124179 merged by Herron:

[operations/puppet@production] aux-k8s codfw: enable worker ingress

https://gerrit.wikimedia.org/r/1124179

Quick status update here, the aux-k8s codfw cluster is running and the aux-k8s-ctrl.svc.codfw.wmnet vip is live.

deploy1003:~# kubectl get nodes
NAME                             STATUS   ROLES           AGE   VERSION
aux-k8s-ctrl2002.codfw.wmnet     Ready    control-plane   21h   v1.23.14
aux-k8s-ctrl2003.codfw.wmnet     Ready    control-plane   21h   v1.23.14
aux-k8s-worker2002.codfw.wmnet   Ready    <none>          21h   v1.23.14
aux-k8s-worker2003.codfw.wmnet   Ready    <none>          21h   v1.23.14
aux-k8s-worker2004.codfw.wmnet   Ready    <none>          21h   v1.23.14
aux-k8s-worker2005.codfw.wmnet   Ready    <none>          21h   v1.23.14

Had a false start with https://gerrit.wikimedia.org/r/1124179 which has been reverted, as I now realize there's more prep needed before activating ingress lvs for codfw. Thinking out loud -- this seems like a step that could be worth decoupling from the worker role application since its a challenge to bring everything up quickly enough to avoid triggering lvs alerts.

At any rate, up next is standing up the aux-k8s codfw ingress, to be followed by a re-try of the reverted ingress lvs patch

Change #1126568 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] add aux-k8s-codfw to environment

https://gerrit.wikimedia.org/r/1126568

Change #1127151 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add delegations for aux-k8s POD ranges in codfw and missing v6 ones

https://gerrit.wikimedia.org/r/1127151

Change #1126182 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns

https://gerrit.wikimedia.org/r/1126182

Change #1127151 merged by Cathal Mooney:

[operations/dns@master] Add delegations for aux-k8s POD ranges in codfw

https://gerrit.wikimedia.org/r/1127151

Change #1126568 merged by jenkins-bot:

[operations/deployment-charts@master] add aux-k8s-codfw to environment

https://gerrit.wikimedia.org/r/1126568

Change #1128868 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] jaeger: add aux-k8s-codfw environment

https://gerrit.wikimedia.org/r/1128868

Change #1128869 had a related patch set uploaded (by Herron; author: Herron):

[operations/deployment-charts@master] jaeger: hooks: fix typo

https://gerrit.wikimedia.org/r/1128869

Change #1128868 merged by jenkins-bot:

[operations/deployment-charts@master] jaeger: add aux-k8s-codfw environment

https://gerrit.wikimedia.org/r/1128868

Change #1128869 merged by jenkins-bot:

[operations/deployment-charts@master] jaeger: hooks: fix typo

https://gerrit.wikimedia.org/r/1128869

Change #1126180 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add k8s ingress service aliases for jaeger in codfw

https://gerrit.wikimedia.org/r/1126180

Change #1128926 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s-codfw: populate network subnet and constants

https://gerrit.wikimedia.org/r/1128926

Change #1128926 merged by Herron:

[operations/puppet@production] aux-k8s-codfw: populate network subnet and constants

https://gerrit.wikimedia.org/r/1128926

Change #1128937 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] aux-k8s codfw: enable worker ingress

https://gerrit.wikimedia.org/r/1128937

Change #1128937 merged by Ssingh:

[operations/puppet@production] aux-k8s codfw: enable worker ingress

https://gerrit.wikimedia.org/r/1128937

herron claimed this task.
herron added a subscriber: ssingh.

Thanks to @ssingh the k8s-ingress-aux.svc.codfw.wmnet LVS is alive!

And with the checklist in the task desc now complete, I think we're in good shape to optimistically resolve this setup task

Change #1126180 merged by Dzahn:

[operations/dns@master] add k8s ingress service aliases for jaeger in codfw

https://gerrit.wikimedia.org/r/1126180

Change #1132587 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] hiera: add aux-k8s-codfw to deployment_server

https://gerrit.wikimedia.org/r/1132587

Change #1132587 merged by Kamila Součková:

[operations/puppet@production] hiera: add aux-k8s-codfw to deployment_server

https://gerrit.wikimedia.org/r/1132587

Change #1126182 merged by Dzahn:

[operations/dns@master] create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns

https://gerrit.wikimedia.org/r/1126182

Change #1133176 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] service: add k8s-ingress-aux-(ro|rw) discovery entries

https://gerrit.wikimedia.org/r/1133176

Change #1133176 merged by Herron:

[operations/puppet@production] service: add k8s-ingress-aux-(ro|rw) discovery entries

https://gerrit.wikimedia.org/r/1133176

Change #1133195 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] conftool-data: update k8s-ingress-aux

https://gerrit.wikimedia.org/r/1133195

Change #1133195 merged by Herron:

[operations/puppet@production] conftool-data: update k8s-ingress-aux

https://gerrit.wikimedia.org/r/1133195