Page MenuHomePhabricator

lexeme-forms tool intermittently slow since move to Stretch bastion
Closed, ResolvedPublic

Description

A few days ago (January 14th), I attempted to move the lexeme-forms toolforge tool from Ubuntu Trusty to Debian Stretch (cf. documentation), by stopping the webservice from the Trusty bastion and then rebuilding the Python venv and starting the webservice from the Stretch bastion. Since the tool uses the Kubernetes backend, I’m not sure if this actually makes a difference, but since then it seems to be plagued by intermittent bursts of very slow response time.

To confirm that this is not just a vague feeling, I checked the timings in the access log with the following command:

sed -E -n 's/^.*\[\w+ (\w+ [0-9]+ [0-9]+:[0-9]+:[0-9]+ [0-9]+)\] [A-Z]+ .* generated [0-9]+ bytes in ([0-9]+) msecs.*$/\2\t\1/p' uwsgi.log

The full output is available at F27930388; a brief check with awk '$1 >= 5000' shows, for example, that there are about as many requests taking over five seconds from January 14th to now as there were from June 2018 (when the tool was first deployed) to January 13th, so it seems clear to me that this is a new problem, not one I just didn’t happen to notice earlier.

Some of these requests are ones where the server has to do some work, including MediaWiki API requests; however, there are also some simple requests, such as the index page or static CSS/JS files (served by Flask’s default static endpoint, which uses WSGI X-Sendfile / Linux sendfile(2)), taking tens of seconds.

Event Timeline

Since the tool uses the Kubernetes backend, I’m not sure if this actually makes a difference,

It does not, UNLESS you actually built a new virtualenv from the Stretch bastion instead of from inside a Python Kubernetes container. If you did that then you may be having some unspecified issues from a python version mismatch. The python2 and python3 Kubernetes containers are currently based on Jessie rather than Stretch.

If mixed python version venvs is not the problem, it could be something systemic on the Kubernetes grid. We have had end-user reports of a few other slow to respond Kubernetes tools, but every time I have actually looked myself the problem magically disappeared.

I think I built the venv from inside a Kubernetes shell, but I’m not quite sure. I could always rebuild it just to be sure…

Alright, just to be sure, I rebuilt the venv.

tools.lexeme-forms@tools-sgebastion-06:~$ webservice --backend=kubernetes python shell
Defaulting container name to interactive.
Use 'kubectl describe pod/interactive -n lexeme-forms' to see all of the containers in this pod.
If you don't see a command prompt, try pressing enter.
tools.lexeme-forms@interactive:~$ python3 -m venv ~/www/python/venv-new
tools.lexeme-forms@interactive:~$ source ~/www/python/venv-new/bin/activate
VENV:tools.lexeme-forms@interactive:~$ pip install --upgrade pip
Downloading/unpacking pip from https://files.pythonhosted.org/packages/46/dc/7fd5df840efb3e56c8b4f768793a237ec4ee59891959d6a215d63f727023/pip-19.0.1-py2.py3-none-any.whl#sha256=aae79c7afe895fb986ec751564f24d97df1331bb99cdfec6f70dada2f40c0044
  Downloading pip-19.0.1-py2.py3-none-any.whl (1.4MB): 1.4MB downloaded
Installing collected packages: pip
  Found existing installation: pip 1.5.6
    Uninstalling pip:
      Successfully uninstalled pip
Successfully installed pip
Cleaning up...
VENV:tools.lexeme-forms@interactive:~$ pip install -r ~/www/python/src/requirements.txt 
DEPRECATION: Python 3.4 support has been deprecated. pip 19.1 will be the last one supporting it. Please upgrade your Python as Python 3.4 won't be maintained after March 2019 (cf PEP 429).
# ...
Installing collected packages: Werkzeug, itsdangerous, MarkupSafe, jinja2, click, flask, pyyaml, urllib3, certifi, idna, chardet, requests, oauthlib, requests-oauthlib, six, PyJWT, mwoauth, mwapi, pymysql, toolforge, pluggy, atomicwrites, more-itertools, py, attrs, scandir, pathlib2, pytest
  Running setup.py install for pyyaml ... done
  Running setup.py install for scandir ... done
Successfully installed MarkupSafe-1.1.0 PyJWT-1.7.1 Werkzeug-0.14.1 atomicwrites-1.2.1 attrs-18.2.0 certifi-2018.11.29 chardet-3.0.4 click-7.0 flask-1.0.2 idna-2.8 itsdangerous-1.1.0 jinja2-2.10 more-itertools-5.0.0 mwapi-0.5.1 mwoauth-0.3.3 oauthlib-3.0.1 pathlib2-2.3.3 pluggy-0.8.1 py-1.7.0 pymysql-0.9.3 pytest-4.1.1 pyyaml-3.13 requests-2.21.0 requests-oauthlib-1.2.0 scandir-1.9.0 six-1.12.0 toolforge-4.2.0 urllib3-1.24.1
VENV:tools.lexeme-forms@interactive:~$ exit
logout
Session ended, resume using 'kubectl attach interactive -c interactive -i -t' command when the pod is running
Pod stopped. Session cannot be resumed.

tools.lexeme-forms@tools-sgebastion-06:~$ webservice --backend=kubernetes python stop
Stopping webservice
tools.lexeme-forms@tools-sgebastion-06:~$ mv ~/www/python/venv ~/www/python/venv-old
tools.lexeme-forms@tools-sgebastion-06:~$ mv ~/www/python/venv-new ~/www/python/venv
tools.lexeme-forms@tools-sgebastion-06:~$ webservice --backend=kubernetes python start
Starting webservice...
tools.lexeme-forms@tools-sgebastion-06:~$

(I’ve replaced (venv-new) with VENV: in the prompt because the shell-session syntax highlighter doesn’t seem to like the original version.)

Let’s see if the slowness improves.

Hm, no requests slower than two seconds today…

tools.lexeme-forms@tools-sgebastion-06:~$ tac uwsgi.log | sed -E -n 's/^.*\[\w+ (\w+ [0-9]+ [0-9]+:[0-9]+:[0-9]+ [0-9]+)\] [A-Z]+ .* generated [0-9]+ bytes in ([0-9]+) msecs.*$/\2\t\1/p' | awk '$3 != 30 {exit} 1 {print}' | sort -nr | head -25
1827    Jan 30 20:48:25 2019
1347    Jan 30 21:04:06 2019
1316    Jan 30 16:48:48 2019
1300    Jan 30 16:37:45 2019
1278    Jan 30 20:41:19 2019
1217    Jan 30 16:41:51 2019
1132    Jan 30 20:57:58 2019
1118    Jan 30 21:14:08 2019
1115    Jan 30 20:41:21 2019
1111    Jan 30 16:47:50 2019
1081    Jan 30 20:31:45 2019
1034    Jan 30 16:39:55 2019
1008    Jan 30 16:41:27 2019
994     Jan 30 20:51:16 2019
981     Jan 30 16:32:21 2019
961     Jan 30 16:35:09 2019
926     Jan 30 16:54:41 2019
893     Jan 30 16:56:39 2019
873     Jan 30 16:52:37 2019
848     Jan 30 11:34:04 2019
844     Jan 30 16:50:05 2019
833     Jan 30 11:14:23 2019
830     Jan 30 21:04:35 2019
816     Jan 30 21:16:10 2019
801     Jan 30 21:00:39 2019

For reference, there were still requests taking up to 120 seconds yesterday, just before I rebuilt the venv. I’ll check the log again in a few days, but it looks like the venv rebuild might have fixed this after all.

LucasWerkmeister claimed this task.

A few days later, five requests slower than two seconds but none slower than five seconds:

tools.lexeme-forms@tools-sgebastion-06:~$ tac ~/uwsgi.log | sed -E -n 's/^.*\[\w+ (\w+ *[0-9]+ [0-9]+:[0-9]+:[0-9]+ [0-9]+)\] [A-Z]+ .* generated [0-9]+ bytes in ([0-9]+) msecs.*$/\2\t\1/p' | awk '$2 == "Jan" && $3 < 30 {exit} 1 {print}' | sort -nr | head -50
4771    Jan 31 20:04:58 2019
4700    Jan 31 20:13:21 2019
3412    Jan 31 20:05:00 2019
2282    Feb  3 09:22:31 2019
2227    Feb  1 08:18:12 2019
1944    Feb  1 16:29:45 2019
1827    Jan 30 20:48:25 2019
1392    Jan 31 16:45:34 2019
1347    Jan 30 21:04:06 2019
1316    Jan 30 16:48:48 2019
1300    Jan 30 16:37:45 2019
1287    Feb  4 09:15:30 2019
1278    Jan 30 20:41:19 2019
1226    Feb  3 09:45:07 2019
1217    Jan 30 16:41:51 2019
1200    Jan 31 16:48:25 2019
1151    Jan 31 16:57:46 2019
1132    Jan 30 20:57:58 2019
1118    Jan 30 21:14:08 2019
1115    Jan 30 20:41:21 2019
1111    Jan 30 16:47:50 2019
1104    Feb  1 16:27:32 2019
1081    Jan 30 20:31:45 2019
1053    Feb  4 09:37:41 2019
1046    Jan 31 20:02:06 2019
1034    Jan 30 16:39:55 2019
1020    Jan 31 20:10:02 2019
1008    Jan 30 16:41:27 2019
1007    Feb  3 11:59:11 2019
994     Jan 30 20:51:16 2019
981     Jan 30 16:32:21 2019
977     Feb  1 16:32:09 2019
968     Jan 31 20:06:59 2019
961     Jan 30 16:35:09 2019
951     Jan 31 20:11:51 2019
926     Jan 30 16:54:41 2019
921     Jan 31 20:20:50 2019
909     Feb  4 10:02:02 2019
895     Jan 31 12:15:23 2019
893     Jan 30 16:56:39 2019
873     Jan 30 16:52:37 2019
848     Jan 30 11:34:04 2019
847     Feb  2 13:04:01 2019
844     Jan 30 16:50:05 2019
833     Jan 30 11:14:23 2019
830     Jan 30 21:04:35 2019
823     Jan 31 20:19:40 2019
816     Jan 30 21:16:10 2019
801     Jan 30 21:00:39 2019
797     Feb  3 20:27:06 2019

And for comparison, when including just one more day:

tools.lexeme-forms@tools-sgebastion-06:~$ tac ~/uwsgi.log | sed -E -n 's/^.*\[\w+ (\w+ *[0-9]+ [0-9]+:[0-9]+:[0-9]+ [0-9]+)\] [A-Z]+ .* generated [0-9]+ bytes in ([0-9]+) msecs.*$/\2\t\1/p' | awk '$2 == "Jan" && $3 < 29 {exit} 1 {print}' | sort -nr | head
120642  Jan 29 20:29:36 2019
101449  Jan 29 16:26:34 2019
100881  Jan 29 18:28:20 2019
81863   Jan 29 16:28:51 2019
62183   Jan 29 18:28:19 2019
50626   Jan 29 20:36:33 2019
45706   Jan 29 18:32:36 2019
45342   Jan 29 16:26:25 2019
44426   Jan 29 18:22:47 2019
40453   Jan 29 18:25:50 2019

So either this really was due to my messing up when building the venv, or it was due to some unrelated problem whose timing happened to coincide with both the broken and the fixed venv rebuild :)

Either way, it seems to be okay now, so let’s close this. Thanks for the tip @bd808!