Page MenuHomePhabricator

Toolforge: cleanup legacy kubernetes cluster
Closed, ResolvedPublic

Description

Some stuff to clean, in the legacy kubernetes cluster specifically:

  • make sure T246519: "paws-public" tool running 2 custom pods on legacy Kubernetes cluster is solved.
  • tools-proxy nodes usage of the cluster
  • bunch of empty worker nodes.
  • etcd servers (and make sure nothing in paws is using it!)
  • flannel stuff, no longer in use by kubernetes.
  • the entire toollabs module (once the 'master' and workers are down) - they should be the last things actually using it
  • monitoring / prometheus metrics
  • grafana dashboards that only work with the old cluster (they won't get anything from prometheus anymore)
  • any other mechanism interacting with the legacy k8s cluster
  • get rid of old non-sssd docker containers in the registry

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -44
operations/puppetproduction+4 -8
operations/puppetproduction+0 -80
operations/software/tools-webservicemaster+73 -176
operations/puppetproduction+0 -12
operations/puppetproduction+15 -2
operations/puppetproduction+8 -13
operations/puppetproduction+3 -1
operations/puppetproduction+0 -3 K
operations/puppetproduction+0 -6
operations/puppetproduction+11 -14
operations/puppetproduction+0 -23
operations/puppetproduction+49 -5
operations/puppetproduction+0 -76
operations/puppetproduction+1 -38
operations/puppetproduction+0 -1 K
operations/puppetproduction+15 -29
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bd808 triaged this task as High priority.Mar 2 2020, 5:39 PM

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T10:51:04Z] <arturo> cordoned/drained all legacy k8s worker nodes except 1001/1002 (T246689)

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T10:54:53Z] <arturo> deleted VMs tools-worker-[1003-1020] (legacy k8s cluster) (T246689)

Change 576469 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Bstorm):
[operations/puppet@production] toolforge: remove special configuration for kubernetes on proxy servers

https://gerrit.wikimedia.org/r/576469

Change 576469 merged by Bstorm:
[operations/puppet@production] toolforge: remove special configuration for kubernetes on proxy servers

https://gerrit.wikimedia.org/r/576469

Change 576911 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: clean up maintain_kubeusers and legacy proxy puppet code.

https://gerrit.wikimedia.org/r/576911

Change 576911 merged by Bstorm:
[operations/puppet@production] toolforge: clean up maintain_kubeusers and legacy proxy puppet code.

https://gerrit.wikimedia.org/r/576911

Remove gridengine as a possible vector for flannel because it was only on the proxies and k8s. It is now removed from the proxies.

Change 576992 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove monitoring for old k8s cluster nodes and flannel etcd

https://gerrit.wikimedia.org/r/576992

Change 576995 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove old k8s client material for Jessie

https://gerrit.wikimedia.org/r/576995

Change 576992 merged by Bstorm:
[operations/puppet@production] toolforge: remove monitoring for old k8s cluster nodes and flannel etcd

https://gerrit.wikimedia.org/r/576992

Change 577261 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] tools-prometheus: removing material related to the legacy k8s cluster

https://gerrit.wikimedia.org/r/577261

paws etcd is entirely in a container on the "master" node:

root      1158  1118 17 Jan09 ?        9-17:09:11 kube-apiserver --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --tls-private-key-file=/etc/kubernetes/pki/apiserver.key --advertise-address=172.16.2.205 --service-account-key-file=/etc/kubernetes/pki/sa.pub --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --insecure-port=0 --requestheader-group-headers=X-Remote-Group --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-allowed-names=front-proxy-client --admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,NodeRestriction,ResourceQuota --requestheader-username-headers=X-Remote-User --service-cluster-ip-range=10.96.0.0/12 --client-ca-file=/etc/kubernetes/pki/ca.crt --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --secure-port=6443 --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --enable-bootstrap-token-auth=true --allow-privileged=true --authorization-mode=Node,RBAC --etcd-servers=http://127.0.0.1:2379

So no worries there.

Change 577279 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-clush: correct the classifications and remove legacy k8s

https://gerrit.wikimedia.org/r/577279

Change 577261 merged by Bstorm:
[operations/puppet@production] tools-prometheus: removing material related to the legacy k8s cluster

https://gerrit.wikimedia.org/r/577261

That removed prometheus metrics.

Change 577341 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolschecker: Update to monitor the new etcd cluster

https://gerrit.wikimedia.org/r/577341

Change 577341 merged by Bstorm:
[operations/puppet@production] toolschecker: Update to monitor the new etcd cluster

https://gerrit.wikimedia.org/r/577341

Change 576995 merged by Bstorm:
[operations/puppet@production] toolforge: remove old k8s client material for Jessie

https://gerrit.wikimedia.org/r/576995

Change 578408 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Remove temporary code from 2020 Kubernetes migration

https://gerrit.wikimedia.org/r/578408

Change 577279 merged by Bstorm:
[operations/puppet@production] toolforge-clush: correct the classifications and remove legacy k8s

https://gerrit.wikimedia.org/r/577279

Mentioned in SAL (#wikimedia-cloud) [2020-03-16T19:06:10Z] <bstorm_> shutting down toolsbeta-worker-1001, toolsbeta-k8s-master and toolsbeta-k8s-etcd T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-16T19:07:50Z] <bstorm_> shutting down toolsbeta-flannel-etcd-01 T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-16T19:45:01Z] <bstorm_> deleting toolsbeta-worker-1001, toolsbeta-k8s-master, toolsbeta-flannel-etcd-01 and toolsbeta-k8s-etcd-01 T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-16T21:38:28Z] <bstorm_> removed lots of hiera related to the legacy k8s cluster T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-16T21:59:42Z] <bstorm_> shut down tools-worker-1001 and tools-worker-1002 T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-16T22:00:23Z] <bstorm_> shut off tools-k8s-master-01 T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-16T22:01:24Z] <bstorm_> shut off tools-k8s-etcd-01/02/03 T246689

Change 580134 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-k8s: remove old legacy-cluster code from bastion

https://gerrit.wikimedia.org/r/580134

Change 580134 merged by Bstorm:
[operations/puppet@production] toolforge-k8s: remove old legacy-cluster code from bastion

https://gerrit.wikimedia.org/r/580134

Mentioned in SAL (#wikimedia-cloud) [2020-03-17T00:08:05Z] <bstorm_> shut off tools-flannel-etcd-01/02/03 T246689

Ok, the entire legacy cluster in tools, including flannel is shut down. I'll leave that overnight and delete things more in the morning. In toolsbeta, things are deleted and hiera values removed.

PAWS is still up and working, so that's good :)

Mentioned in SAL (#wikimedia-cloud) [2020-03-18T16:59:40Z] <bstorm_> deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" T246689

By "flannel stuff", I'm going to assume we mean the puppet code.

Mentioned in SAL (#wikimedia-cloud) [2020-03-18T17:36:01Z] <bstorm_> removed lots of deprecated hiera keys from horizon for the old cluster T246689

Noticed that the ToolsDB cluster wasn't sending prometheus metrics because the security group needed an update when testing puppet after stripping out a lot of old hiera. Fixed that.

Mentioned in SAL (#wikimedia-cloud) [2020-03-18T17:57:39Z] <bstorm_> removed puppet prefix tools-k8s-master T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-18T17:58:07Z] <bstorm_> removed puppet prefix tools-worker T246689

Mentioned in SAL (#wikimedia-cloud) [2020-03-18T18:04:37Z] <bstorm_> removed puppet prefix tools-flannel-etcd T246689

Change 581056 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove the entire toollabs module and all related roles

https://gerrit.wikimedia.org/r/581056

Change 581056 merged by Bstorm:
[operations/puppet@production] toolforge: remove almost entire toollabs module and related roles

https://gerrit.wikimedia.org/r/581056

Change 581647 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: refactor docker builder to remove toollabs module

https://gerrit.wikimedia.org/r/581647

Change 581654 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolserver: refactor into profile and move off "toollabs" name

https://gerrit.wikimedia.org/r/581654

Change 581673 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: fix file location for grid override.my.cnf

https://gerrit.wikimedia.org/r/581673

Change 581673 merged by Bstorm:
[operations/puppet@production] toolforge: fix file location for grid override.my.cnf

https://gerrit.wikimedia.org/r/581673

Change 581647 merged by Bstorm:
[operations/puppet@production] toolforge: refactor docker builder to remove toollabs module

https://gerrit.wikimedia.org/r/581647

Change 581654 merged by Bstorm:
[operations/puppet@production] toolserver: refactor into profile and move off "toollabs" name

https://gerrit.wikimedia.org/r/581654

Change 582065 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove the last toollabs role

https://gerrit.wikimedia.org/r/582065

Change 582065 merged by Bstorm:
[operations/puppet@production] toolforge: remove the last toollabs role

https://gerrit.wikimedia.org/r/582065

Change 582090 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] k8s: purge flannel from the environment

https://gerrit.wikimedia.org/r/582090

I also shored up the etcd health dashboard.

It seems like the Kubernetes Cluster dash is totally useless. We have the Toolforge K8s dashboard, which is very nice so far (and newer!).
I think it's just an old one that we can delete https://grafana-labs.wikimedia.org/d/000000001/kubernetes-cluster?orgId=1
@aborrero does that sound right.

Change 578408 merged by jenkins-bot:
[operations/software/tools-webservice@master] Remove temporary code from 2020 Kubernetes migration

https://gerrit.wikimedia.org/r/578408

It seems like the Kubernetes Cluster dash is totally useless. We have the Toolforge K8s dashboard, which is very nice so far (and newer!).
I think it's just an old one that we can delete https://grafana-labs.wikimedia.org/d/000000001/kubernetes-cluster?orgId=1
@aborrero does that sound right.

sounds right!

Change 582090 merged by Bstorm:
[operations/puppet@production] k8s: purge flannel from the environment

https://gerrit.wikimedia.org/r/582090

Change 583165 merged by Bstorm:
[operations/puppet@production] toolforge: clean up the now-unneeded ferm_handlers

https://gerrit.wikimedia.org/r/583165

Change 583170 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge cleanup: remove the ferm_handlers profile

https://gerrit.wikimedia.org/r/583170

Change 583170 merged by Bstorm:
[operations/puppet@production] toolforge cleanup: remove the ferm_handlers profile

https://gerrit.wikimedia.org/r/583170

Bstorm lowered the priority of this task from High to Medium.Mar 25 2020, 5:36 PM
Bstorm updated the task description. (Show Details)

All high priority portions of this task are done. The rest is just cleanup that doesn't block anything else.

I've deleted the Kubernetes Cluster dashboard (redundant at best, but also broken).

Mentioned in SAL (#wikimedia-cloud) [2020-03-30T23:42:21Z] <bstorm_> deleted "Kubernetes Cluster" and "Kubernetes Performance" dashboards T246689

Apparently, to remove docker tags/images from Docker Hub, which seems to use a very similar API to our registry, you'd just literally send DELETE requests against those URLs. Sounds like time to experiment!