Page MenuHomePhabricator

"paws-public" tool running 2 custom pods on legacy Kubernetes cluster
Closed, ResolvedPublic

Description

The paws-public tool is running these objects on the legacy kubernetes cluster:

$ /usr/local/bin/kubectl get deployment,rs,po,svc
NAME                READY     STATUS    RESTARTS   AGE
po/nbserve-t5pht    1/1       Running   2          129d
po/renderer-h8f9t   1/1       Running   2          142d
NAME              CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
svc/paws-public   192.168.0.168   <none>        8000/TCP   3y
svc/renderer      192.168.0.125   <none>        8000/TCP   3y

The tool's $HOME has no documentation or YAML files to repeat creation of these objects on the 2020 Kubernetes cluster. We need to figure out how to migrate them.

NOTE: Interacting with the legacy Kubernetes cluster needs an older version of kubectl. Use /data/project/paws-public/bin/kubectl.

Event Timeline

Looks like the pods come from ReplicationControllers instead of ReplicaSets, interesting.

I have a hunch that the nbserve container could be completely replaced with an Ingress object using various annotations to tune its behavior.

There are some special ingress concerns apparently for this setup. This is all a big nbconvert machine to render notebooks into HTML.
I highly recommend against changing the nbserve container to an ingress object just now because I have a task out there for the internships related to it 😉

I've made sure that there is kubectl v1.4.12 in /data/project/paws-public/bin/ so that we can still interact with the old cluster to move it. I don't know if this is deployed using travis at all like the rest of paws, but I suspect it is not.

Setting to high priority since this is the final blocker to turning off the old cluster.

tools.paws-public@tools-sgebastion-08:~$ bin/kubectl get all
NAME          DESIRED   CURRENT   READY     AGE
rc/nbserve    1         1         1         3y
rc/renderer   1         1         1         3y
NAME              CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
svc/paws-public   192.168.0.168   <none>        8000/TCP   4y
svc/renderer      192.168.0.125   <none>        8000/TCP   3y
NAME                READY     STATUS    RESTARTS   AGE
po/nbserve-t5pht    1/1       Running   2          131d
po/renderer-h8f9t   1/1       Running   2          145d

There are some serious rabbit holes here. I suspect this was deployed by hand. Shell history suggests that it hasn't been touched since it was deployed except a few bits. With a quota change and a couple yaml files, these could easily be simple deployments and services, but I suspect that something from paws is also involved to know where things are....but maybe not! In a custom deploy, the label toolforge: tool needs to be on the pods to get all the environment and volume mount stuff automatically.

I highly recommend against changing the nbserve container to an ingress object just now because I have a task out there for the internships related to it 😉

Note: This is just because it could obscure parts of the task. I could totally live with having to update the task or even take it off the roster if needed lol.

Also https://github.com/toolforge/paws/tree/master/images is where the dockerfiles are kept. I haven't checked for nslcd vs sssd issues or anything like that. They may just work as is in our registry (though, they are jessie images, IIRC).

It shouldn't be too hard to translate the config over to the new cluster by hand if I up the services quota for the tool to 2. I can do that today, and we could sort out collapsing nbserve into its ingress object after it's up and working (if @bd808 isn't already doing that).

Should be easy to translate the RCs into deployments.

Found an interesting blocker. This mounts /data/project/paws/userhomes
That's not going to be allowed by the podsecuritypolicy without changes. I'm going to have to special-case it, most likely, by adding an additional PSP for this tool. PSPs are additive, so it should be fairly clear that there are two policies this tool can use.

Creating deploy.yaml to house the new objects this will require. So far, just a couple services and deployments (since I can, I'm using the apps/v1 deployment object instead of the deprecated one used by webservice).

Trying to find what sets: os.environ['RENDERER_PORT_8000_TCP_ADDR']

I can request a specific address, but I'm hoping this is set up to not need me to.

Trying to find what sets: os.environ['RENDERER_PORT_8000_TCP_ADDR']

Duh, that's kubernetes that does that. They just didn't have DNS working yet in the old cluster https://kubernetes.io/docs/concepts/services-networking/service/#discovering-services

That's fine and will be supported going forward, so leaving it.

So I add a standard ingress, the quota/psp change and this is done.

I'll start by deploying just the services. That way we can verify it is up and working. Then when I deploy the ingress, this will be on the new cluster.

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T16:38:14Z] <bstorm_> increased the services quota to 2 in the 2020 k8s cluster T246519

Ok, I do NOT need a special psp because the path prefix we have set should cover it. I'm deploying the services to see how it goes without removing anything from the old cluster.

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T16:47:55Z] <bstorm_> deploy services and deployments on the 2020 k8s cluster via kubectl apply -f deploy.yaml T246519

Ok, so far, I have the test services up on the new cluster, and I have not broken the tool yet on the old cluster:

tools.paws-public@tools-sgebastion-08:~$ kubectl get all
NAME                            READY   STATUS    RESTARTS   AGE
pod/nbserve-cf9f5f6bd-mv9fl     1/1     Running   0          3m57s
pod/renderer-54c84f9b5d-qkls2   1/1     Running   0          3m57s


NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/paws-public   ClusterIP   10.96.26.240   <none>        8000/TCP   3m58s
service/renderer      ClusterIP   10.96.155.16   <none>        8000/TCP   3m58s


NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nbserve    1/1     1            1           3m58s
deployment.apps/renderer   1/1     1            1           3m58s

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/nbserve-cf9f5f6bd     1         1         1       3m58s
replicaset.apps/renderer-54c84f9b5d   1         1         1       3m58s

I'll test that service from inside a pod and see if I get sane replies, then sort out the ingress.

This seems to work!

/app # curl paws-public.tool-paws-public.svc.tools.local:8000/paws-public/

Gonna try rendering a notebook, then I'll put up the ingress.

That came up as 404. Checking what it is doing with file perms. Good chance the old ones worked with a different UID in a weird way.

it definitely answered the request, so the perms issue is not it, yay!

[pid: 11|app: 0|req: 1/1] 192.168.148.14 () {32 vars in 469 bytes} [Tue Mar  3 17:09:30 2020] GET /paws-public/0036878/CoupledOscillators.ipynb => generated 9 bytes in 4068 msecs (HTTP/1.1 404) 2 headers in 86 bytes (1 switches on core 0)

The index of the user's dir doesn't do the rewrite. Waaaaaiiiiit. The base image these use could cause issues

FROM docker-registry.tools.wmflabs.org/jessie-toollabs

I think I need to build new paws-public images for the new cluster, which is easy enough....and long overdue anyway.

Ok, so the nginx image for this resists upgrade because of:
2020/03/03 19:46:13 [emerg] 6#6: unknown directive "lua_shared_dict" in /tmp/nginx.conf:69

I'm not going to troubleshoot that right now unless necessary. The renderer pod is very happy running buster.

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T21:55:49Z] <bstorm_> launched ingress on the new cluster, removing the service object on the old cluster T246519

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T21:57:51Z] <bstorm_> recreated the service on the old cluster because it didn't work right away? T246519

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T22:07:35Z] <bstorm_> removing the service object on the old cluster again T246519

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T22:20:13Z] <bstorm_> created ingress for paws-public.wmflabs.org and deleted old service object T246519

Mentioned in SAL (#wikimedia-cloud) [2020-03-03T22:23:08Z] <bstorm_> deleted all resources on the old cluster T246519

Bstorm claimed this task.