Page MenuHomePhabricator

Copyvios tool webservice failed to start on new Kubernetes cluster
Closed, ResolvedPublicBUG REPORT

Description

This ticket will unfortunately be a bit devoid of information, because I wasn't sure how to debug the problem and I didn't want to leave the tool broken.

I followed the instructions here, first for "earwigbot" as a test which worked well, then for "copyvios" which failed.

After running webservice start, webservice status reported "Your webservice of type python2 is running" but https://tools.wmflabs.org/copyvios/ returned 503s. kubectl get pods reported "No resources found", as did k8s-status, which showed the service/deployment but with no pods and with the "None/None" under Ready for the deployment. I gave it several minutes (~8) in case it was struggling to allocate resources but nothing changed. I then tried webservice restart which informed me "Your job is not running, starting", contrary to webservice status, but this did not change anything.

I had originally attempted webservice start with overridden limits to see how it would perform (--cpu 1 --mem 4), but thinking these might be the problem, I tried it again without those and it still did not work.

I've now migrated back to the old cluster and the tool is back to working.

Event Timeline

bd808 subscribed.

@Earwig What help would you like debugging this? I can try the migration myself, but it would be good to know what tolerance you and your users have to downtime before I start messing around.

@bd808 I don't mind 15-20 minutes of downtime if you would like to try yourself (especially now when activity should be lower).

bd808 moved this task from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.

I will try the migration and see if I either magically get different results (possible if it was some resource contention issue) or can find some more descriptive symptoms of failure.

Mentioned in SAL (#wikimedia-cloud) [2020-02-03T21:29:09Z] <bd808> Attempting migration to 2020 Kubernetes cluster (T244107)

$ cat service.manifest
# This file is used by toollabs infrastructure.
# Please do not edit manually at this time.
backend: kubernetes
distribution: debian
version: 2
web: python2
$ webservice status
Your webservice of type python2 is running
$ webservice stop
Stopping webservice
$ kubectl get pods
No resources found.
$ kubectl config use-context toolforge
Switched to context "toolforge".
$ alias kubectl=/usr/bin/kubectl
$ echo "alias kubectl=/usr/bin/kubectl" >> $HOME/.profile
$ webservice --backend=kubernetes python2 start
Starting webservice...........
$ kubectl get all
NAME                            READY   STATUS    RESTARTS   AGE
pod/copyvios-6dfd899c7f-9pxtt   1/1     Running   0          29s


NAME               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/copyvios   ClusterIP   10.100.195.222   <none>        8000/TCP   29s


NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/copyvios   1/1     1            1           29s

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/copyvios-6dfd899c7f   1         1         1       29s

Looks like it worked:

Mentioned in SAL (#wikimedia-cloud) [2020-02-03T21:34:06Z] <bd808> Now running on 2020 Kubernetes cluster (T244107)

@Earwig if you see anything I did in T244107#5846217 that did not match what you tried to do when following https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#Manually_migrate_a_webservice_to_the_new_cluster please let me know so I can fix the docs or add some other advice.

Thanks for the help! The steps you followed seem to match what I tried, so my only theory now is that trying to override the memory/CPU caused it to fail. Unfortunately, now that it's up it seems the default memory limit is too low as my workers are getting SIGKILL'd frequently. Could you help me figure out how to raise that properly? Running webservice restart with -m 4 -c 1 doesn't seem to change anything.

-mem 4Gi might work. 4 will not give you the results you expect for memory. As a memory field that may literally mean MB.

Additionally, you must webservice stop and then webservice start --backend, etc to add parameters. restart will just restart exactly as is.

Thanks for the help! The steps you followed seem to match what I tried, so my only theory now is that trying to override the memory/CPU caused it to fail. Unfortunately, now that it's up it seems the default memory limit is too low as my workers are getting SIGKILL'd frequently. Could you help me figure out how to raise that properly? Running webservice restart with -m 4 -c 1 doesn't seem to change anything.

I spent some time looking at the configuration for the tool and found that it has a $HOME/www/python/uwsgi.ini file which overrides the default --workers 4 setting with a processes = 6 setting (uwsgi has too many configuration synonyms for my taste). This higher than normal setting for concurrency plus the lower default memory cap on the 2020 Kubernetes cluster may be causing the OOM problems. There are several things we could try here, but I think the most straight forward will be to bump the memory limits as @Earwig attempted previously. Rather than jumping to the full 4Gi hard limit, I will start with 2Gi:

$ webservice stop
$ webservice --backend=kubernetes --mem 2Gi --cpu 1 python2 start

https://tools.wmflabs.org/k8s-status/namespaces/tool-copyvios/pods/copyvios-64899b5cbd-swfqs/

It was interesting to note that the uwsgi container was recording SIGKILL for worker processes ("DAMN ! worker 3 (pid: 4846) died, killed by signal 9 :( trying respawn ...") rather than Kubernetes recording full Pod shutdowns as I saw with the autodesc tool yesterday. Does this mean that uwsgi is actually pruning processes to stay within memory limits rather than Kubernetes or the kernel doing the pruning?

Does this mean that uwsgi is actually pruning processes to stay within memory limits rather than Kubernetes or the kernel doing the pruning?

No, I think this is still the kernel OOM-killer. It just happens that the workers are getting killed, not the uwsgi master process (pid 1 in the container), so it's able to restart them. I'm not sure how the autodesc tool is set up, but it's possible it has a single pid 1 worker process (maybe multithreaded), so when the kernel kills that the container has no choice but to shut down? Either way, thanks again for the help.