Investigate moving PAWS to magnum
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	rook
	May 20 2022, 3:38 PM

Description

Right now there are some custom bits of PAWS in relation to k8s, this keeps us stuck on our current deploy of k8s. Ideally we would be able to deploy PAWS to any vanilla k8s cluster. Investigate that.

Related Objects
Search...

Status	Assigned	Task
Resolved	rook	T280792 Investigate Openstack Magnum
Resolved	rook	T308873 Investigate moving PAWS to magnum
Resolved	rook	T321886 mount nfs directly into pods
Resolved	rook	T326258 k8s 1.22 magnum template for PAWS
Resolved	rook	T326264 Deploy paws to magnum
Resolved	rook	T326257 k8s 1.21 magnum template for PAWS
Resolved	rook	T326260 Normalize PAWS resource usage
Resolved	rook	T326262 Temporary increase of PAWS quota
Resolved	None	T325540 Nodeport for ingress-nginx
Resolved	rook	T325746 ingress-nginx
Resolved	rook	T325812 upgrade jupyterhub chart
Resolved	rook	T326268 New trove db for magnum
Resolved	rook	T326276 Deploy paws dev to codfw1dev
Resolved	Andrew	T326331 Deploy paws-dev trove db
Resolved	rook	T326588 open refine not loading in codfw1dev
Resolved	rook	T326631 Setup nfs for paws-dev
Resolved	rook	T326629 Setup DNS for paws-dev.codfw1dev.wmcloud.org.
Resolved	rook	T326723 env vars for nbserve and renderer requests

Event Timeline

rook created this task.May 20 2022, 3:38 PM

rook claimed this task.May 20 2022, 4:02 PM

I think volume persistence is the biggest blocker here, right?

Yes, I believe that would be the largest obstacle. And the bulk of the code that I was working on in
https://github.com/toolforge/paws/compare/T308873
The basic idea is that we would still use nfs, but k8s would do the mounting of it rather than the worker nodes. Thus we're more k8s native, and could, in concept, lift up and move to a different k8s cluster. I got this working in a devstack deployment. Though it doesn't really test the main detail, does this actually mount our cloud vps nfs (I made a local, though separate from k8s, nfs to test). We're starting to tinker with magnum in the codfw1dev, but it is not working yet to test there.

There is a functional prototype in codfw1dev. Need to hack on your hosts file to get it working though.

rook mentioned this in T317787: Pod Security Policies.Sep 28 2022, 4:50 PM

rook mentioned this in T318885: Disable PodSecurityPolicy on Magnum.Sep 29 2022, 4:15 AM

rook mentioned this in T319366: PawsJupyterHubDown alert flapping.Oct 5 2022, 12:00 PM

rook closed subtask T321886: mount nfs directly into pods as Resolved.Nov 28 2022, 3:54 PM

Prototype exists in parallel with prod paws in magnum. Magnum cluster was deployed with:
openstack coe cluster create rook1 --cluster-template core-34-k8s21-100g --master-count 1 --node-count 1 --floating-ip-disabled
kube config file generated with:
openstack coe cluster config rook1 --dir /tmp/
This is limited to things that can reach the cluster directly (or applying --insecure-skip-tls-verify to kubectl requests). Probably not a problem as this can probably be managed from the bastion, and can certainly be managed from another node in a project.

nginx ingress is needed as described in https://kubernetes.github.io/ingress-nginx/deploy/ though this deploys an lb type, which you can still use by putting the port in haproxy, though this isn't ideal, additional work is needed to identify how to move this to a native nodeport as the current cluster uses.

We will need to add the following annotation
kubernetes.io/ingress.class: "nginx"
on various ingress entries. Might also need:

spec:
  ingressClassName: nginx

At this point I pointed the extra floating IP in paws to a test haproxy instance, and the haproxy at the magnum cluster. This seems to work (though could easily be knocked over, it is very small) by editing the hosts file:

185.15.56.58 hub.paws.wmcloud.org
185.15.56.58 paws.wmcloud.org
185.15.56.58 paws.wmflabs.org
185.15.56.58 public.paws.wmcloud.org
185.15.56.58 paws-public.wmflabs.org

(Don't rely on this working, I remove it regularly for tinkering.)

The remaining notes are old, but were used, need additional research on what of these steps is needed, further elaboration on some parts of this in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Devstack_magnum/PAWS_dev_devstack

kubectl config set-context --current --namespace=prod
kubectl rollout restart -n kube-system daemonset.apps/kube-flannel-ds # not sure if this is needed any longer

cd paws
cd cloud-provider-openstack
kubectl create -f manifests/cinder-csi-plugin/csi-secret-cinderplugin.yaml # not sure if this is needed any longer
kubectl -f manifests/cinder-csi-plugin/ apply  # not sure if this is needed any longer

cd ..
kubectl apply -f sc.yaml # not sure if this is needed any longer
cd <paws git directory>
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm dep up paws/
kubectl create namespace prod
helm install paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml --timeout=50m
kubectl apply -f manifests/psp.yaml

So long as we like this solution as is we would need to test a magnum upgrade, for which we will need to be on openstack yoga, as xena appears to be limited to k8s 1.21

cluster template created with:

openstack coe cluster template create core-34-k8s21-100g --image magnum-fedora-coreos-34 --external-network wan-transport-eqiad --fixed-network lan-flat-cloudinstances2b --fixed-subnet cloud-instances2-b-eqiad --dns-nameserver 8.8.8.8 --network-driver flannel --docker-storage-driver overlay2 --docker-volume-size 100 --master-flavor g3.cores1.ram2.disk20 --flavor g3.cores1.ram2.disk20 --coe kubernetes --labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true --public

Upgrading the cluster

openstack coe cluster upgrade rook3 core-34-k8s22-100g
This updates the control node, though worker doesn't, cluster status marked as upgrade failed, gives:
Upgrading a cluster when status is "UPDATE_FAILED" is not supported (HTTP 400) (Request-ID: req-5f1a6d9f-6331-4d12-8be9-d379e6cec237)
on retrying upgrade command. The following clears upgrade failed and returns status to UPDATE_IN_PROGRESS:
openstack coe cluster resize <cluster name> <current number of worker nodes>
Eventually it marks "UPDATE_COMPLETE" but it is only the control node that is updated, not the worker. Running the upgrade again doesn't seem to do anything. Scaling the cluster to 0 workers, then back to one seems to get the cluster all up
graded. Though the haproxy needs updated to point to the new worker nodes at this point.

The value of this upgrade method is somewhat questionable when compared to deploying a new cluster and redeploying PAWS to it, as the latter is fairly trivial. The deploying a new cluster method also has no downtime (Though in PAWS case may cause existing auth sessions to get confused, requiring user to have to log out and back in), and tests disaster recovery on each upgrade.

rook mentioned this in T325373: Is PAWS down?.Dec 16 2022, 4:25 PM

rook added a subtask: T325540: Nodeport for ingress-nginx.Dec 19 2022, 12:28 PM

rook closed subtask T325540: Nodeport for ingress-nginx as Resolved.Dec 19 2022, 2:16 PM

rook added a subtask: T325746: ingress-nginx.Dec 21 2022, 2:29 PM

rook closed subtask T325746: ingress-nginx as Resolved.Jan 4 2023, 12:50 PM

rook added a subtask: T326258: k8s 1.22 magnum template for PAWS.Jan 4 2023, 5:53 PM

rook removed a subtask: T326257: k8s 1.21 magnum template for PAWS.

rook mentioned this in T326260: Normalize PAWS resource usage.Jan 4 2023, 6:14 PM

rook added a subtask: T326260: Normalize PAWS resource usage.

rook added a subtask: T326262: Temporary increase of PAWS quota.Jan 4 2023, 6:39 PM

rook removed a subtask: T326260: Normalize PAWS resource usage.Jan 4 2023, 6:50 PM

rook removed a subtask: T326262: Temporary increase of PAWS quota.