Page MenuHomePhabricator

"teg" tool needs a higher Services quota to migrate to 2020 Kubernetes cluster
Closed, ResolvedPublic

Description

https://tools.wmflabs.org/admin/tool/teg seems to be using a frontend managed using webservice and a backend from a custom Deployment running a Java service.

The webservice migrate process moved the frontend to the 2020 Kubernetes cluster with no issues. Manually attempting to move the custom deployment as found in the tool's $HOME/backend.yml file showed a partial failure due to the quota for Service objects:

$ /usr/bin/kubectl create --validate=true -f backend.yml
deployment.extensions/teg-backend created
Error from server (Forbidden): error when creating "backend.yml": services "teg-backend" is forbidden: exceeded quota: tool-teg, requested: services=1, used: services=1, limited: services=1

The tool seems to have a number of other issues as well, including a non-functional $HOME/.lighttpd.conf:

$ kubectl logs -f teg-6b4f8669c8-b5ndj
Undefined env variable: TEG_BACKEND_SERVICE_HOST
2020-02-29 22:24:28: (configfile.c.1154) source: /var/run/lighttpd/teg line: 615 pos: 19 parser failed somehow near here: (COMMA)

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-02-29T22:32:23Z] <wm-bot> <root> Stopped php7.2 webservice stuck in CrashLoopBackOff due to a syntaxically invalid /data/project/teg/.lighttpd.conf file (T246553)

Mentioned in SAL (#wikimedia-cloud) [2020-02-29T22:34:01Z] <wm-bot> <root> Deleted partially applied /data/project/teg/backend.yml Kubernetes deployment on 2020 Kubernetes cluster (T246553)

Mentioned in SAL (#wikimedia-cloud) [2020-02-29T22:35:26Z] <wm-bot> <root> Deleted /data/project/teg/backend.yml Kubernetes deployment on legacy cluster (T246553)

For what it's worth, the $HOME/.lightttpd.conf works as expected when the environment variables TEG_BACKEND_SERVICE_HOST and TEG_BACKEND_SERVICE_PORT are set. Up to now, these were set automatically from the backend deployment, so it's not a separate issue, but indeed caused by the backend failure.

@Bstorm I remember talking with you about these resource limits and how we would provide per-tool quota changes, but I don't remember how to actually do it. :)

I didn't document it, unfortunately! You can edit the quota for the namespace using cluster admin https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota

It'd probably work fine to just do a kubectl edit resourcequota -n tool-$tool $quotaname You may not even need to get the quotaname since there's only going to be one in each namespace.

The quota is named after the namespace (in this case tool-teg).
kubectl get resourcequota -n tool-teg -o yaml will show it to you. I'll change it now.

root@tools-k8s-control-1:~# kubectl edit resourcequota -n tool-teg tool-teg
resourcequota/tool-teg edited

That worked.

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T20:45:52Z] <bstorm_> increased services quota to 2 for k8s T246553

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T20:48:38Z] <bstorm_> running kubectl apply -f backend.yml T246553

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T20:49:37Z] <bstorm_> starting php7.2 webservice T246553

Ok, I did those in the wrong order, clearly

51s         Warning   FailedCreate        replicaset/teg-backend-649f697c88   Error creating: pods "teg-backend-649f697c88-2kpkk" is forbidden: maximum cpu usage per Container is 1, but limit is 2

Ah, no, the problem is that the backend is greedier than I thought.

I have reduced the number of CPUs requested to 1, and everything is working again now.

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T20:58:35Z] <bstorm_> set namespace resourcequota for cpu to 2.5 T246553

Ah, I also increased your quota a bit. Either way, that works!

tools.teg@tools-sgebastion-08:~$ kubectl get all
NAME                               READY   STATUS    RESTARTS   AGE
pod/teg-6b4f8669c8-rq8mr           1/1     Running   0          3m38s
pod/teg-backend-854584765b-kkjpv   1/1     Running   0          3m57s


NAME                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/teg           ClusterIP   10.102.199.138   <none>        8000/TCP   7m
service/teg-backend   ClusterIP   10.109.158.132   <none>        4223/TCP   6m38s


NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/teg           1/1     1            1           7m
deployment.apps/teg-backend   1/1     1            1           7m15s

NAME                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/teg-6b4f8669c8           1         1         1       7m
replicaset.apps/teg-backend-854584765b   1         1         1       3m57s
Mmarx claimed this task.

Thanks!