Page MenuHomePhabricator

Scholia on Toolforge is not responsive and does not start
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:

Go to https://tools.wmflabs.org/scholia

Actual Results:

No response from webservice or after a long time "502 Bad Gateway
openresty/1.15.8.1"

Expected Results:

Webservice should respond and display a webpage


Scholia Toolforge webservice is operated by me and the webservice is restarted with webservice --backend=kubernetes python3.7 restart as usual but now does not restart correctly.

The uwsgi.log reads

*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 424 seconds on interpreter 0x5574d4daf090 pid: 1 (default app)
mounting /data/project/scholia/www/python/src/app.py on /scholia
*** Starting uWSGI 2.0.18-debian (64bit) on [Sat Feb 29 08:41:55 2020] ***
compiled with version: 8.2.0 on 10 February 2019 02:42:46
os: Linux-4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26)
nodename: scholia-5bbd659585-6nhcn
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 4
current working directory: /data/project/scholia
detected binary path: /usr/bin/uwsgi-core
chdir() to /data/project/scholia/www/python/src
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :8000 fd 4
Python version: 3.7.3 (default, Dec 20 2019, 18:57:59)  [GCC 8.3.0]
PEP 405 virtualenv detected: /data/project/scholia/www/python/venv
Set PythonHome to /data/project/scholia/www/python/venv
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x55aecfe07090
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 364600 bytes (356 KB) for 4 cores
*** Operational MODE: preforking ***

Earlier this week, there was a similar problem with the webservice, resulting in an 50x error with a similar uwsgi.ini logging. WMF staff were around at the same time, working on scheduled maintenance, and the day after, I found the webservice to be responsive. Now, a few days later, the problem came back.

The webservice has recently been updated to Python3.7, and it is possible that the problem stems from this upgrade or the transition to the new K8s. The webservice can be started locally with python3.7 runserver.py with no problem.

I suppose it could be a resource problem, specifically memory.

https://tools.wmflabs.org/k8s-status/namespaces/tool-scholia/ claims that it is running.

https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&var-namespace=tool-scholia&refresh=5m I am not familiar with reading Grafana, but I do not see any issue.

I note that another one of my tools, the wembedder Toolforge webservice, seems to have a similar problem.

The issue is also tracked at Scholia issues: https://github.com/fnielsen/scholia/issues/1062

Event Timeline

So I looked into this a a bit and found a strange thing. The service is running on the new cluster and appears to be doing what it should be. However, I can see that there is also a pod left behind on the old cluster, somehow. No deployment or service, just a pod and a replicaset.

I deleted that to clean up (but you'd be interested to note that @bd808).

Then I took a look at the node your app was running on (using kubectl get pods -o wide). It had this in the log:

[85767.722980] Memory cgroup out of memory: Kill process 16320 (uwsgi) score 1978 or sacrifice child
[85767.753808] Killed process 16320 (uwsgi) total-vm:616152kB, anon-rss:514900kB, file-rss:13344kB, shmem-rss:656kB

So, I gave you more memory and am trying that in case that's your uwsgi process getting killed. webservice --backend=kubernetes -m 1Gi python3.7 start

It took nearly a minute before I saw it finally go live (which is a function of how long it takes for the Kubernetes ingress to reload...which is quite a while at this point), but it worked.

Bstorm claimed this task.
Bstorm triaged this task as High priority.

The current way webservice works makes it so that webservice restart will work fine without args, but if you run webservice stop, you will want to add the --mem or -m argument like this webservice --backend=kubernetes -m 1Gi python3.7 start when you start it. The app needs more memory reserved than the minimum.

I'm going to make a subtask about liveness checks for webservice pods. This pod should not have stayed "stable" when the uwsgi process had been killed. It should have been doing a crashloopbackoff, which would have made it easier to understand there was a problem.

Thanks for the explanation and the restart

I am wondering whether @Bstorm could clarify if -m 1Gi is a typo and should have been -m 1G?

I am wondering whether @Bstorm could clarify if -m 1Gi is a typo and should have been -m 1G?

Gi is the proper binary prefix to represent 1024³ or 2³⁰ bytes. The enwiki article on binary prefix has a lot of good information about the common imprecise usage of the decimal prefixes. Internally Kubernetes supports both, but also properly treats 1G as 1000³ and 1Gi as 1024³.