Page MenuHomePhabricator

PAWS: Rebuild and upgrade Kubernetes
Closed, ResolvedPublic


PAWS Kubernetes needs to be upgraded as it is well outside of supported releases at this point.

Co-opting this task to introduce a full redesign of the PAWS Kubernetes layer to work similar to the build of Toolforge Kubernetes.
Following that model the resulting cluster will be

  • Highly available
  • Debian Buster
  • Much more secure
  • Hopefully use a normal k8s ingress like Toolforge does
  • Be understood and supportable for WMCS
  • Largely puppetized T188912
  • kubeadm controlled

Scope notes

Related Objects


Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T22:05:51Z] <bstorm_> temporarily deleted the deployment for maintain-kubeusers pending patch to fix context creation for new admin accounts T211096 T246059

Change 598863 merged by jenkins-bot:
[labs/tools/maintain-kubeusers@master] contexts: context should be correct for project

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T22:34:39Z] <bstorm_> restored the deployment for maintain-kubeusers so anyone added to the paws.admin group will have admin on the cluster now that the bug is fixed T211096 T246059

I'm putting up a series of test images in to validate settings and changes in the new cluster. I'm hoping not to push into the main repo until we are ready for a PR. Then I'll do a manual tag push in order to allow the deploy-hook to use cached versions of the images (so it isn't compiling openresty).

Mentioned in SAL (#wikimedia-cloud) [2020-06-12T18:49:03Z] <bstorm_> deployed a test of paws chart in the new cluster T211096

I did the deploy with root on paws-k8s-control-1 with the command

helm install paws --namespace prod ./paws -f paws/secrets.yaml --set=dbProxy.image.tag=test --set=deployHook.image.tag=test --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.image.tag=test --set=jupyterhub.hub.db.type=sqlite --set=jupyterhub.singleuser.image.tag=test

To ensure only the test images are used and this doesn't touch the database of the running paws instance. This is effectively what the script does with a lot more control.

bstorm@paws-k8s-control-1:~/src/paws$ kubectl -n prod get pods
NAME                              READY   STATUS    RESTARTS   AGE
continuous-image-puller-6kmb7     1/1     Running   0          2m54s
continuous-image-puller-d4tn5     1/1     Running   0          2m54s
continuous-image-puller-llh74     1/1     Running   0          2m54s
continuous-image-puller-skw95     1/1     Running   0          2m54s
db-proxy-8559ddc847-wzl5d         1/1     Running   0          2m54s
deploy-hook-7867c755bc-97cc9      1/1     Running   0          2m54s
proxy-5cbf487898-rlzrv            1/1     Running   0          2m54s
user-scheduler-59b557759f-65lq5   1/1     Running   0          2m54s
user-scheduler-59b557759f-jpg7j   1/1     Running   0          2m54s

The deploy is happy so far. This will give the ingress something to target in T195217: Simplify ingress methods for PAWS

I redeployed it quickly to make sure it had the new volume changes right. The volumes in the current PAWS deployment won't work for dumps anymore at all.

If the whole project drags on too long I'll do a separate PR from a feature branch in the toolforge/paws repo to fix the volumes on the live setup.

We also have now fixed a quirk of permissions for the homes that was corrected by an initcontainer previously. Fixing it in the hub values means the containers don't need root.

Helm3 seems to "just work" ™
That said, I think we should leave that box unchecked until we are deploying via a deployhook and instead of by hand :)

Bstorm updated the task description. (Show Details)

We should put this into tools-prometheus:

I suggest we use the metricsinfra prometheus instead to reduce coupling with the tools project.

I just created T256361: PAWS: get new service and cluster metrics into prometheus

Mentioned in SAL (#wikimedia-cloud) [2020-06-25T16:39:53Z] <bstorm> deleted the deployhook from the in-progress new cluster for now just in case T211096

Mentioned in SAL (#wikimedia-cloud) [2020-06-25T22:43:44Z] <bstorm> bumped quota up to 24 instances, 128 GB RAM and 56 cores T211096

Mentioned in SAL (#wikimedia-cloud) [2020-06-25T22:52:04Z] <bstorm> created paws-k8s-worker-5/6/7 as x-large nodes to bring the cluster up to roughly the same capacity as the existing one using soft anti-affinity T211096 T253267

I've added 3 x-large nodes to the cluster to see how that works out. They should work better than 6 large nodes (which would be a replacement for what's in tools-paws) because the larger nodes will handle the overhead better.

bstorm@paws-k8s-control-1:~$ kubectl --as admin --as-group system:masters get nodes
NAME                 STATUS   ROLES     AGE     VERSION
paws-k8s-control-1   Ready    master    30d     v1.16.10
paws-k8s-control-2   Ready    master    30d     v1.16.10
paws-k8s-control-3   Ready    master    30d     v1.16.10
paws-k8s-ingress-1   Ready    ingress   21d     v1.16.10
paws-k8s-ingress-2   Ready    ingress   21d     v1.16.10
paws-k8s-worker-1    Ready    <none>    30d     v1.16.10
paws-k8s-worker-2    Ready    <none>    30d     v1.16.10
paws-k8s-worker-3    Ready    <none>    30d     v1.16.10
paws-k8s-worker-4    Ready    <none>    30d     v1.16.10
paws-k8s-worker-5    Ready    <none>    15m     v1.16.10
paws-k8s-worker-6    Ready    <none>    6m23s   v1.16.10
paws-k8s-worker-7    Ready    <none>    29s     v1.16.10

Mentioned in SAL (#wikimedia-cloud) [2020-07-23T21:06:44Z] <bstorm> pushing dbproxy docker image for new cluster into main repo T211096

Mentioned in SAL (#wikimedia-cloud) [2020-07-23T21:08:28Z] <bstorm> pushing to the main repo T211096

Mentioned in SAL (#wikimedia-cloud) [2020-07-23T21:09:53Z] <bstorm> pushing to the main repo T211096

Mentioned in SAL (#wikimedia-cloud) [2020-07-23T21:11:44Z] <bstorm> pushing to main repo T211096

Mentioned in SAL (#wikimedia-cloud) [2020-07-23T21:14:36Z] <bstorm> pushing to main repo T211096

Trimmed the deploy command to helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.db.type=sqlite

That can remain permanent after we start adding the latest tag to images (though the deploy-hook should be more specific). I'm just moving toward sane defaults for development and simplification purposes. On switch-over, I will be dropping the sqlite params and it will *just* be: helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml. That that point, we are helming! The deploy-hook should specify tags and image dev would override the images as well, but we shouldn't require all that just to basically run helm and have it not explode. This is especially true because I'm pushing off deploy-hook updates until after the cutover.

Bstorm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2020-07-23T22:48:50Z] <bstorm> tagged the newbuild tags with "latest" to set sane defaults for all images in the helm chart T211096

Mentioned in SAL (#wikimedia-cloud) [2020-07-23T22:51:38Z] <bstorm> deploying via the default 'latest' tag in the new cluster T211096

Updated the command used to deploy and upgrade here

That's for taking over this! I promised many times that I would do it, and then never did it for real....

I saw the diagram and left some comments in the talk page, here:

One thing: I believe we don't need the DNS wildcard for * and instead we could just hardcode the few actual domains we use:

  • A

Not having the wildcard may reduce confusion for example with the `k8s.svc.paws.wmcloud,org' address, which is not an actual service address for anything.

I will introduce this DNS change now, feel free to revert if you think this is wrong!

Mentioned in SAL (#wikimedia-cloud) [2020-07-24T09:39:46Z] <arturo> dropped the DNS wildcard record * IN A and created concrete CNAME records for the FQDNs we actually use (T211096)

I've reviewed the plan, and I like it! +1, thanks!

Not having the wildcard may reduce confusion for example with the `k8s.svc.paws.wmcloud,org' address, which is not an actual service address for anything.

Oops! I messed up the service address in my diagram. It is supposed to be :)

My diagram also could confuse a person into thinking that CoreDNS labels these services with these addresses (which it doesn't, they have very different addresses in CoreDNS). It's just to show the general flows, really.

@aborrero I think the specific records are more correct, in general, so that's cool. The wildcard was just the super-easy thing to do while banging on it :)

One thing I will say is that the k8s API is built to be public-facing when RBAC and auth are correctly implemented with a load balancer out front. There isn't a very strong reason we keep it restricted to local shell access. Because of the general security model here, I don't intend to make it public right now--just a comment. I'd prefer if Kubernetes x509-based auth had a clear way of generating revocation lists before doing that, as well.

The PR is up. Please, nobody merge it until after we move DNS (which implies we've done a final NFS sync and actually announced the change first).

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T15:53:18Z] <bstorm> downtiming alerts in case they need changes (seems likely) T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T15:58:10Z] <bstorm> switching old cluster to sqlite T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T16:02:15Z] <bstorm> switching old cluster to toolsdb T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T16:02:34Z] <bstorm> LAST MESSAGE WRONG: switching NEW cluster to toolsdb T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T16:08:06Z] <bstorm> changing to point at the new cluster ip T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T17:05:45Z] <bstorm> running the final rsync to the new cluster's nfs T211096

Change 619019 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] paws: monitor the new URLs instead of the deprecated ones

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T17:49:52Z] <bstorm> shutting down the entire old cluster T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T18:01:02Z] <bstorm> shutting down paws-proxy-02 T211096

Change 619019 merged by Bstorm:
[operations/puppet@production] paws: monitor the new URLs instead of the deprecated ones

Mentioned in SAL (#wikimedia-cloud) [2020-08-07T22:30:53Z] <bstorm> removing downtime for paws and front page monitor T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-14T17:04:42Z] <bstorm> deleting instances "tools-paws-master-01", "tools-paws-worker-1005", "tools-paws-worker-1006", "tools-paws-worker-1003", "tools-paws-worker-1002", "tools-paws-worker-1001", "tools-paws-worker-1007", "tools-paws-worker-1013", "tools-paws-worker-1016", "tools-paws-worker-1017", "tools-paws-worker-1010", "tools-paws-worker-1019" T211096

Mentioned in SAL (#wikimedia-cloud) [2020-08-14T17:09:38Z] <bstorm> backing up the old proxy config to NFS and deleting paws-proxy-02 T211096

Bstorm claimed this task.
Bstorm updated the task description. (Show Details)

First phase of productionalizing and enabling paws is done.