Performance and OOM issues with Flask tool on Kubernetes
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Fnielsen
	Jul 7 2017, 12:28 AM

Description

I have created a Python Flask application that uses quite a lot of memory, - several hundreds of MB. I have a problem with the application as it sometimes do not repond.

I see "502 Bad Gateway" in the response and various respawnings in the uwsgi.log, e.g.,

[pid: 29|app: 0|req: 15/62] 192.168.146.0 () {38 vars in 621 bytes} [Fri Jul  7 00:17:34 2017] GET /wembedder/ => generated 1412 bytes in 10 msecs (HTTP/1.1 200) 2 headers in 81 bytes (1 switches on core 0)
[pid: 29|app: 0|req: 16/63] 192.168.146.0 () {40 vars in 712 bytes} [Fri Jul  7 00:17:36 2017] GET /wembedder/most-similar/ => generated 2823 bytes in 3 msecs (HTTP/1.1 200) 2 headers in 81 bytes (1 switches on core 0)
[pid: 29|app: 0|req: 17/64] 192.168.146.0 () {40 vars in 746 bytes} [Fri Jul  7 00:18:04 2017] GET /wembedder/most-similar/Q315062 => generated 4091 bytes in 4 msecs (HTTP/1.1 200) 2 headers in 81 bytes (1 switches on core 0)
DAMN ! worker 1 (pid: 29) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 30)

I have also noted that unrelated Magnus Manske applications, e.g., sourcemd, have been fairly slow in the last couple of days.

Is there are problem with applications that use so much memory. I was under the impression that the service would be able to handle up to 4 GB. Is it possible for me to see the load off the computer? I tried with top, qstat or kubectl. webservice --backend=kubernetes python start takes several minutes, while on my computer the setup of the application takes a few seconds.

Related Objects

Mentioned In: T173312: Resource based (CPU/IO/Disk) quota imposed on a per-namespace basis in k8s cluster

Event Timeline

Fnielsen created this task.Jul 7 2017, 12:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 7 2017, 12:28 AM

Fnielsen added a project: Toolforge.Jul 7 2017, 12:29 AM

signal 9 is SIGKILL. OOM killer?

I see "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py"

'python': {
    'cls': PythonWebService,
    'image': 'toollabs-python-web',
    'resources': {
         'limits': {
            # Pods will be killed if they go over memory limit
            'memory': '2Gi',
            # Pods can still burst to more than cpu limit
            'cpu': '2',
         },
         'requests': {
            # Pods are guaranteed at least this many resources
            'memory': '256Mi',
            'cpu': '0.125'
         }
    }
},

With 2 Gi I might run into problems. Does the number of workers (where the default i 4 as far as I know) mean that there is only 500 MB for each service?

Usually cgroup limits are per process but I'm not entirely sure how this shakes out here, can you run an experiment with fewer workers to see outcomes? What Tool is this? Where is this Tool running atm? How long does it normally take from start to become unresponsive?

I do not know how to start it with few workers. Isn't the number 4 hardcoded into /usr/lib/python2.7/dist-packages/toollabs/webservice/services/uwsgiwebservice.py?

The tool is this one: http://tools.wmflabs.org/wembedder/
At the moment it performs fine.

I have seen it become slow overnight. I restarted it this morning with webservice --backend=kubernetes python start.

On my Flask development server, Python3 is using 1.2 GB fairly constantly (somewhat more than I anticipated). I suppose that I run into memory problems. I wonder how I can get around this.

I still get "502 Bad Gateway". It seems to be tied to one worker. Currently, I get "Respawned uWSGI worker 1" several times, while the other workers (that I see with different pid) does not seem to run into problems.

bd808 renamed this task from Tool labs slow and kills application to Performance and OOM issues with Flask tool on Kubernetes.Jul 11 2017, 9:02 PM

bd808 added a project: Tools.

Is there are problem with applications that use so much memory. I was under the impression that the service would be able to handle up to 4 GB.

By default most webservices running on our Kubernetes cluster are allowed a hard limit of 2GB of ram. Java gets 4GB by default because Java. We have a process for requesting more quota that is documented on wikitech. Looking at the backend code, it seems that implementing quota bumps for Kubernetes has not been done yet.

Is it possible for me to see the load off the computer? I tried with top, qstat or kubectl.

It depends on what context you want to see.

kubectl describe nodes will show you the status of all the exec nodes in the Kubernetes cluster (slow and verbose).
kubectl describe node <hostname> will show you a single exec node. You can find out which node your pod is running on with kubectl get pod <pod> -o json|jq '.spec.nodeName'
You can get a shell inside a running pod with kubectl exec -i -t <pod> /bin/bash. The containers that are used to run things on the Kubernetes grid are pretty bare-bones, so you may have difficulty from there.

webservice --backend=kubernetes python start takes several minutes, while on my computer the setup of the application takes a few seconds.

Starting the pod requires finding a free space in the exec nodes, downloading a Docker image, starting that Docker image, attaching various NFS exports, and finally starting your Python process from an NFS share. This actually happens pretty quickly for all the moving parts, but it is certainly not as fast as starting a local process.

NOTE: All of the kubectl commands are easier if you first run source <(kubectl completion bash) in your shell. The completion code installed by this command knows how to tab complete almost everything.

$ source <(kubectl completion bash)
$ kubectl describe node<TAB>
$ kubectl describe node tools-worker-10<TAB><TAB>
tools-worker-1001.tools.eqiad.wmflabs  tools-worker-1015.tools.eqiad.wmflabs
tools-worker-1002.tools.eqiad.wmflabs  tools-worker-1016.tools.eqiad.wmflabs
tools-worker-1003.tools.eqiad.wmflabs  tools-worker-1017.tools.eqiad.wmflabs
tools-worker-1004.tools.eqiad.wmflabs  tools-worker-1018.tools.eqiad.wmflabs
tools-worker-1005.tools.eqiad.wmflabs  tools-worker-1019.tools.eqiad.wmflabs
tools-worker-1006.tools.eqiad.wmflabs  tools-worker-1020.tools.eqiad.wmflabs
tools-worker-1007.tools.eqiad.wmflabs  tools-worker-1021.tools.eqiad.wmflabs
tools-worker-1008.tools.eqiad.wmflabs  tools-worker-1022.tools.eqiad.wmflabs
tools-worker-1009.tools.eqiad.wmflabs  tools-worker-1023.tools.eqiad.wmflabs
tools-worker-1010.tools.eqiad.wmflabs  tools-worker-1025.tools.eqiad.wmflabs
tools-worker-1011.tools.eqiad.wmflabs  tools-worker-1026.tools.eqiad.wmflabs
tools-worker-1012.tools.eqiad.wmflabs  tools-worker-1027.tools.eqiad.wmflabs
tools-worker-1013.tools.eqiad.wmflabs  tools-worker-1028.tools.eqiad.wmflabs
tools-worker-1014.tools.eqiad.wmflabs  tools-worker-1029.tools.eqiad.wmflabs
$ kubectl describe node tools-worker-10

Jprorama mentioned this in T173312: Resource based (CPU/IO/Disk) quota imposed on a per-namespace basis in k8s cluster.Aug 14 2017, 2:09 PM

Performance and OOM issues with Flask tool on KubernetesOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Performance and OOM issues with Flask tool on Kubernetes
Open, MediumPublic
Actions