Page MenuHomePhabricator

Raise tool memory limit - similarity
Closed, InvalidPublic

Description

Hi, I'm experimenting with a new tool named similarity, which attempts to load a big dataset in memory at startup. It never actually finishes its startup process, and after a few minutes I get the following in uwsgi.log:

detected binary path: /usr/bin/uwsgi-core
your processes number limit is 63707
your process address space limit is 4294967296 bytes (4096 MB)
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
(...)
SIGINT/SIGQUIT received...killing workers...
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
worker 3 buried after 1 seconds
worker 4 buried after 1 seconds
goodbye to uWSGI.

When running it locally on my laptop, I see the tool taking up 3.6GiB of memory, with 1.2GiB RSS. It seems plausible that it's reaching the 4GiB limit and receiving a SIGQUIT because of that - but please correct me if this assumption is incorrect.

I do plan on making the tool more memory-efficient eventually, but if memory is indeed the issue, would it be possible to increase the limit so I can continue prototyping it? Judging from the memory consumption locally, it seems a small bump would be enough, but ideally it would be nice to have, say, 10GiB so I won't need to worry about it very soon.

Thanks!

Event Timeline

Ive done some large scale processing but needing more than 4GB of ram is a sign of bad programming. fix that first, dont try to come back later to try and reduce usage then, its too late

Hmm, I did try a couple more low-hanging fruits to reduce the size of the data, which does make a difference locally. Still no luck launching the tool on the grid though.

However, I do see something interesting: if I repeatedly run qstat -j ${JOB} | grep vmem to get information on my job as it starts up, I get output like the following:

hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.32589 GBs, io=0.30038, vmem=1.113G, maxvmem=1.113G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.32589 GBs, io=0.30038, vmem=1.113G, maxvmem=1.113G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.32589 GBs, io=0.30038, vmem=1.113G, maxvmem=1.113G
hard resource_list:         h_vmem=4G,release=trusty
(...)
usage    1:                 cpu=00:00:18, mem=12.32589 GBs, io=0.30038, vmem=1.113G, maxvmem=1.113G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.70640 GBs, io=0.42031, vmem=1.231G, maxvmem=1.231G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.70640 GBs, io=0.42031, vmem=1.231G, maxvmem=1.231G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.70640 GBs, io=0.42031, vmem=1.231G, maxvmem=1.231G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.70640 GBs, io=0.42031, vmem=1.231G, maxvmem=1.231G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.70640 GBs, io=0.42031, vmem=1.231G, maxvmem=1.231G
hard resource_list:         h_vmem=4G,release=trusty
usage    1:                 cpu=00:00:18, mem=12.70640 GBs, io=0.42031, vmem=1.231G, maxvmem=1.231G
Following jobs do not exist:
9906175

I'm not really sure I'm reading that correctly, but I suppose that means the memory usage peaks at 1.231G, which is not really near the limit of 4G, right? If that's the case, then memory isn't really the problem here...

Is the source code available somewhere? Maybe someone could do some memory profiling and see what is going on.

After some fiddling with qacct, this is the end state before the process was killed:

valhallasw@tools-bastion-02:~/accountingtools$ qacct -j 9906175 -f similarity.acct
==============================================================
qname        webgrid-generic
hostname     tools-webgrid-generic-1403.eqiad.wmflabs
group        tools.similarity
owner        tools.similarity
project      NONE
department   defaultdepartment
jobname      uwsgi-python-similarity
jobnumber    9906175
taskid       undefined
account      sge
priority     0
qsub_time    Sat Sep 23 02:22:35 2017
start_time   Sat Sep 23 02:22:36 2017
end_time     Sat Sep 23 02:30:36 2017
granted_pe   NONE
slots        1
failed       0
exit_status  0
ru_wallclock 480
ru_utime     16.639
ru_stime     7.398
ru_maxrss    858036
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    328102
ru_majflt    0
ru_nswap     0
ru_inblock   845504
ru_oublock   56
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     10322
ru_nivcsw    21000
cpu          24.038
mem          18.383
io           0.420
iow          0.000
maxvmem      6.093G
arid         undefined

Note that maxvmem is over 6GB.

There's quite some time between start and end time, which suggests the failure only happens after several requests come in. Maybe each requests again loads a bunch of stuff, and that memory is not being released?

The code is here: https://github.com/eggpi/similarity/blob/master/app.py . Basically I'm trying to load a ~400MiB pickled file in the tool's home directory at start-up. That file contains a sparse matrix that gets traversed on each request.

Interestingly, the tool gets killed without ever receiving any requests that I'm aware of. Here's uwsgi.log with a bit more logging:

*** Starting uWSGI 1.9.17.1-debian (64bit) on [Sun Sep 24 13:03:45 2017] ***
compiled with version: 4.8.2 on 23 March 2014 17:15:32
os: Linux-3.13.0-100-generic #147-Ubuntu SMP Tue Oct 18 16:48:51 UTC 2016
nodename: tools-webgrid-generic-1403
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 4
current working directory: /mnt/nfs/labstore-secondary-tools-project/similarity
detected binary path: /usr/bin/uwsgi-core
your processes number limit is 63707
your process address space limit is 4294967296 bytes (4096 MB)
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :38443 fd 3
Python version: 2.7.6 (default, Oct 26 2016, 20:33:43)  [GCC 4.8.4]
Set PythonHome to /data/project/similarity/www/python/venv
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x22f8b40
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 363960 bytes (355 KB) for 4 cores
*** Operational MODE: preforking ***
mounting /data/project/similarity/www/python/src/app.py on /similarity
datetime.datetime(2017, 9, 24, 13, 3, 54, 220494) starting up
datetime.datetime(2017, 9, 24, 13, 4, 10, 805024) model loaded!
WSGI app 0 (mountpoint='/similarity') ready in 25 seconds on interpreter 0x22f8b40 pid: 18357 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 18357)
spawned uWSGI worker 1 (pid: 18386, cores: 1)
spawned uWSGI worker 2 (pid: 18387, cores: 1)
spawned uWSGI worker 3 (pid: 18388, cores: 1)
spawned uWSGI worker 4 (pid: 18389, cores: 1)
SIGINT/SIGQUIT received...killing workers...
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
worker 3 buried after 2 seconds
worker 4 buried after 2 seconds
goodbye to uWSGI.

It starts up, loads the data in about 20s, tries to spawn worker processes and gets killed. I tried just importing app.py, causing the pickled file to be loaded, then stopping the process and looking at its memory consumption:

$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.getpid()
23547
>>> import app
>>>
[1]+  Stopped                 python
$ ps -o comm,pid,rss,vsz
COMMAND           PID   RSS    VSZ
bash            10399  6244  44816
python          23547 847456 1487204
ps              24216   904  35204

No surprises there, 1.5GiB virtual memory.

I then tried setting processes = 1 in my uwsgi.ini, and the tool now launches properly. It gets killed when evaluating requests though (example: job 9965471). When running locally, I do see a vmem spike when serving requests (to around 4GiB I think, then back down to 1GiB), so this is not very surprising.

In short, my guess is that the worker processes are not sharing memory for some reason, causing the tool's memory usage to be 4x what it should have been. Is this expected or a known issue? Or am I just doing the wrong thing in the tool?

As for the memory usage spikes on each request, I can do a bit more work in the following days to see if I can bring that down, though the spiky nature of it might make it harder to fix.

I'm most likely going to throw away this prototype, so there's no point pursuing this. Thank you all for your input!

For the curious, https://lists.wikimedia.org/pipermail/cloud/2017-September/000003.html has a bit of discussion.