Page MenuHomePhabricator

Clean up file handle and Redis connection management in ORES worker and celery processes
Closed, ResolvedPublic

Description

In T174402, we had to raise the per-process limit to 8192 file handles. This is more than we should need. Investigate and fix.

Event Timeline

OK so I checked on this and it looks there's no effect at all on the file-handle count by dropping the connection to redis in the celery workers. Each celery worker has about 40 open file handles at any given time.

From https://github.com/celery/celery/issues/887 it looks like these filehandles are used to auto-restart workers. I'm not quite sure what that means for us, but we might experiment with turning that functionality off (if we have it on at all).

Maybe rerun test with sudo lsof | wc -l. There's an argument to limit to process ID.

$ lsof -p 19546 
COMMAND   PID   USER   FD      TYPE DEVICE SIZE/OFF    NODE NAME
python3 19546 halfak  cwd       DIR  252,1     4096 1976686 /home/halfak/projects/ores
python3 19546 halfak  rtd       DIR  252,1     4096       2 /
python3 19546 halfak  txt       REG  252,1  4439120 2626657 /home/halfak/venv/3.5/bin/python3
python3 19546 halfak  mem       REG  252,1    15432 4592097 /usr/lib/python3.5/lib-dynload/resource.cpython-35m-x86_64-linux-gnu.so
python3 19546 halfak  mem       REG  252,1   392528 1189091 /home/halfak/venv/3.5/lib/python3.5/site-packages/msgpack/_unpacker.cpython-35m-x86_64-linux-gnu.so
python3 19546 halfak  mem       REG  252,1   374744 1193059 /home/halfak/venv/3.5/lib/python3.5/site-packages/msgpack/_packer.cpython-35m-x86_64-linux-gnu.so
python3 19546 halfak  mem       REG  252,1    29512 3148610 /home/halfak/venv/3.5/lib/python3.5/site-packages/cryptography/hazmat/bindings/_constant_time.cpython-35m-x86_64-linux-gnu.so
... snip ...
python3 19546 halfak   32w     FIFO   0,10      0t0  251034 pipe
python3 19546 halfak   33r     FIFO   0,10      0t0  251035 pipe
python3 19546 halfak   34w     FIFO   0,10      0t0  251035 pipe
python3 19546 halfak   35r     FIFO   0,10      0t0  251036 pipe
python3 19546 halfak   36w     FIFO   0,10      0t0  251036 pipe
python3 19546 halfak   37r     FIFO   0,10      0t0  251037 pipe
python3 19546 halfak   38w     FIFO   0,10      0t0  251037 pipe
python3 19546 halfak   39w     FIFO   0,10      0t0  251038 pipe

Looks like most of the lines are *.so files. Let's skip those.

$ lsof -p 19546 | grep -vP ".so($|.[0-9])" 
COMMAND   PID   USER   FD      TYPE DEVICE SIZE/OFF    NODE NAME
python3 19546 halfak  cwd       DIR  252,1     4096 1976686 /home/halfak/projects/ores
python3 19546 halfak  rtd       DIR  252,1     4096       2 /
python3 19546 halfak  txt       REG  252,1  4439120 2626657 /home/halfak/venv/3.5/bin/python3
python3 19546 halfak  mem       REG  252,1  2981280 4123211 /usr/lib/locale/locale-archive
python3 19546 halfak  mem       REG  252,1    26258 4592340 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
python3 19546 halfak    0u      CHR  136,1      0t0       4 /dev/pts/1
python3 19546 halfak    1u      CHR  136,1      0t0       4 /dev/pts/1
python3 19546 halfak    2u      CHR  136,1      0t0       4 /dev/pts/1
python3 19546 halfak    3r      CHR    1,9      0t0      11 /dev/urandom
python3 19546 halfak    4u  a_inode   0,11        0    7041 [eventpoll]
python3 19546 halfak    5u  a_inode   0,11        0    7041 [eventpoll]
python3 19546 halfak    6r      CHR    1,3      0t0       6 /dev/null
python3 19546 halfak    7r     FIFO   0,10      0t0  251022 pipe
python3 19546 halfak    8w     FIFO   0,10      0t0  251022 pipe
python3 19546 halfak    9r     FIFO   0,10      0t0  251023 pipe
python3 19546 halfak   10w     FIFO   0,10      0t0  251023 pipe
python3 19546 halfak   11r     FIFO   0,10      0t0  251024 pipe
python3 19546 halfak   12w     FIFO   0,10      0t0  251024 pipe
python3 19546 halfak   13r     FIFO   0,10      0t0  251025 pipe
python3 19546 halfak   14w     FIFO   0,10      0t0  251025 pipe
python3 19546 halfak   15r     FIFO   0,10      0t0  251026 pipe
python3 19546 halfak   16w     FIFO   0,10      0t0  251026 pipe
python3 19546 halfak   17r     FIFO   0,10      0t0  251027 pipe
python3 19546 halfak   18w     FIFO   0,10      0t0  251027 pipe
python3 19546 halfak   19r     FIFO   0,10      0t0  251028 pipe
python3 19546 halfak   20w     FIFO   0,10      0t0  251028 pipe
python3 19546 halfak   21r     FIFO   0,10      0t0  251029 pipe
python3 19546 halfak   22w     FIFO   0,10      0t0  251029 pipe
python3 19546 halfak   23r     FIFO   0,10      0t0  251030 pipe
python3 19546 halfak   24w     FIFO   0,10      0t0  251030 pipe
python3 19546 halfak   25r     FIFO   0,10      0t0  251031 pipe
python3 19546 halfak   26w     FIFO   0,10      0t0  251031 pipe
python3 19546 halfak   27r     FIFO   0,10      0t0  251032 pipe
python3 19546 halfak   30w     FIFO   0,10      0t0  251033 pipe
python3 19546 halfak   31r     FIFO   0,10      0t0  251034 pipe
python3 19546 halfak   32w     FIFO   0,10      0t0  251034 pipe
python3 19546 halfak   33r     FIFO   0,10      0t0  251035 pipe
python3 19546 halfak   34w     FIFO   0,10      0t0  251035 pipe
python3 19546 halfak   35r     FIFO   0,10      0t0  251036 pipe
python3 19546 halfak   36w     FIFO   0,10      0t0  251036 pipe
python3 19546 halfak   37r     FIFO   0,10      0t0  251037 pipe
python3 19546 halfak   38w     FIFO   0,10      0t0  251037 pipe
python3 19546 halfak   39w     FIFO   0,10      0t0  251038 pipe

Looks like we have a lot of FIFO pipes. The other fd's seem to be normal. Let's focus on those fifos.

$ lsof -p 19546 | grep FIFO | wc -l
31

OK now to check all of the processes.

$ ps aux | grep celery
halfak   19536  0.6  1.0 628616 82076 pts/1    S+   14:38   0:03 python3 ./utility applications.celery
halfak   19546  0.0  0.7 626004 60292 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   19547  0.0  0.7 626004 60292 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   19548  0.0  0.7 626004 60292 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   19549  0.0  0.7 626004 60296 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   19550  0.0  0.7 626004 60296 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   19551  0.0  0.7 626004 60296 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   19552  0.0  0.7 626004 60304 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   19553  0.0  0.7 626004 60304 pts/1    S+   14:38   0:00 python3 ./utility applications.celery
halfak   20135  0.0  0.0  14224   972 pts/2    S+   14:48   0:00 grep --color=auto celery
$ lsof -p 19536 | grep FIFO | wc -l
48
$ lsof -p 19546 | grep FIFO | wc -l
31
$ lsof -p 19547 | grep FIFO | wc -l
32
$ lsof -p 19548 | grep FIFO | wc -l
33
$ lsof -p 19549 | grep FIFO | wc -l
34
$ lsof -p 19550 | grep FIFO | wc -l
35
$ lsof -p 19551 | grep FIFO | wc -l
36
$ lsof -p 19552 | grep FIFO | wc -l
37
$ lsof -p 19553 | grep FIFO | wc -l
38

Looks like the master process has more FIFO handles and each child process has incrementally more than the first.

This was all running the default ORES server (in celery mode) with no real models. Now to start up prod ORES with all the models.

$ ps aux | grep celery
halfak   20390 16.9 13.6 1800616 1070968 pts/1 S+   14:53   0:14 python ores_celery.py
halfak   20447  0.0 13.5 1798240 1068236 pts/1 S+   14:53   0:00 python ores_celery.py
halfak   20450  0.0 13.5 1798240 1068236 pts/1 S+   14:53   0:00 python ores_celery.py
halfak   20565  0.0  0.0  14224   964 pts/2    S+   14:54   0:00 grep --color=auto celery
$ lsof -p 20390 | grep FIFO | wc -l
12
$ lsof -p 20447 | grep FIFO | wc -l
7
$ lsof -p 20450 | grep FIFO | wc -l
8

Oh... that's surprising. There are fewer FIFO. Maybe it's because I'm running fewer workers. Let's bump that up to 8.

$ ps aux | grep celery
halfak   20666 42.6 13.9 1801436 1097968 pts/1 S+   14:57   0:13 python ores_celery.py
halfak   20679  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20680  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20681  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20682  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20683  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20684  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20685  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20686  0.0 13.6 1799172 1070880 pts/1 S+   14:57   0:00 python ores_celery.py
halfak   20698  0.0  0.0  14224   932 pts/2    S+   14:57   0:00 grep --color=auto celery
$ lsof -p 20666 | grep FIFO | wc -l
48
$ lsof -p 20679 | grep FIFO | wc -l
31
$ lsof -p 20680 | grep FIFO | wc -l
32
$ lsof -p 20681 | grep FIFO | wc -l
33
$ lsof -p 20682 | grep FIFO | wc -l
34
$ lsof -p 20683 | grep FIFO | wc -l
35
$ lsof -p 20684 | grep FIFO | wc -l
36
$ lsof -p 20685 | grep FIFO | wc -l
37
$ lsof -p 20686 | grep FIFO | wc -l
38

OK. That's expected then. But it is curious that the number goes up so much between two and 8 workers. Let's try something bigger -- like 32 workers.

halfak   20823  9.4 13.9 1801736 1098556 pts/1 S+   15:00   0:22 python ores_celery.py
halfak   20839  0.0 13.6 1799360 1071072 pts/1 S+   15:00   0:00 python ores_celery.py
... snip ...
$ lsof -p 20839 | wc -l
300
$ lsof -p 20823 | wc -l
376

So it looks like these FIFOs balloon when we add more workers. I just tried starting up 480 workers (same as ORES* cluster nodes) and was not able to get past the OSError for too many open files.

Looks this is a known thing that the number of connections between each process, c.f. the "handshake problem", if each process has a full set of connections to all previously launched workers. That would increase as triangular numbers. The 8th triangular number is 36, and 32nd is 528, and 480th is 83,436 so not perfect but it looks like a closely related curve.

That's probably because we fork the master processes' open file handles to each child. We should be able to close these at the beginning of each child process's execution, and at least eliminate one major source of FH leakage!

One not quite accounted-for detail is that the number of file handles seems to rise during my stress tests. Following the above theory, the master process at that time has the full set of child pipes, so new children will have each have the N pipes, which ends up being N^2 total file handles (higher than the Triangular number).

I just bumped my ulimit to be able to start up a dev instance with 480 workers. I got

$ lsof -p 4516 | grep FIFO | wc -l
1919

So we should expect that the 480th worker will have 1919 + 480 open FIFOs = 2399. To be conservative, let's say that every worker has that many FIFOs.

2399 * 481 = 1,153,919 FIFO pipes.

By the N^2 guess, we would need 480 x 480 = 230,400 file handles. Cool, so let's take the max of our guesses and try a limit of 2M?

Based on previous estimates, it looks like we open about 173 file handles for each worker -- mostly for *.so files.

So I'd like to update my estimate to (2399 + 173) * 481 = 1,237,132

awight claimed this task.