Review and fix file handle management in worker and celery processes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Aug 28 2017, 10:07 PM

Description

We're seeing large numbers of open file handles, exceeding the Python-Linux select() limit of 1024. Ideally, we can fix this problem by cutting down on the number of open files; less ideally, we can work around by using select.poll instead of select.select.

Use lsof on a cluster worker to determine which type of process is hogging the file handles.

Review the code for correct file handling. Make sure we use context managers to close handles regardless of conditional branching. Forked full-weight processes should close all unnecessary file handles from the parent.

Details

Subject	Repo	Branch	Lines +/-
Bump celery ores worker # files limit	operations/puppet	production	+1 -0
Bumps revscoring to 1.3.19	research/ores/wheels	1.3.x	+0 -0
Give ores-admin users lsof, rather than the deployment user	operations/puppet	production	+2 -2
Let scoring platform team run "lsof" for diagnostics	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T162039 Prepare to service applications from kubernetes
Resolved	akosiaris	T162041 Expand the infrastructure to codfw
Resolved	RobH	T161700 CODFW: (4) hardware access request for kubernetes
Resolved	RobH	T142578 codfw/eqiad:(9+9) hardware access request for ORES
		Unknown Object (Task)
Resolved	akosiaris	T165170 rack/setup/install ores2001-2009
Resolved	None	T176324 Scoring platform team FY18 Q2
Declined	None	T179501 Use external dsh group to list pooled ORES nodes
Resolved	akosiaris	T168073 Switch ORES to dedicated cluster
Resolved	Halfak	T185901 Preliminary deployment of ORES to new cluster
Resolved	akosiaris	T171851 Reimage ores* hosts with Debian Stretch
Resolved	awight	T169246 Stress/capacity test new ores* cluster
Resolved	Halfak	T174402 Review and fix file handle management in worker and celery processes
Resolved	awight	T177036 Clean up file handle and Redis connection management in ORES worker and celery processes

Event Timeline

awight created this task.Aug 28 2017, 10:07 PM

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptAug 28 2017, 10:07 PM

Change 374593 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Let scoring platform team run "lsof" for diagnostics

https://gerrit.wikimedia.org/r/374593

gerritbot added a project: Patch-For-Review.Aug 29 2017, 5:47 PM

SRE I would like lsof permissions on the ORES boxes, https://gerrit.wikimedia.org/r/374593

Let me know!

Change 374593 merged by Ottomata:
[operations/puppet@production] Let scoring platform team run "lsof" for diagnostics

https://gerrit.wikimedia.org/r/374593

Change 374832 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Give ores-admin users lsof, rather than the deployment user

https://gerrit.wikimedia.org/r/374832

Change 374832 merged by Ottomata:
[operations/puppet@production] Give ores-admin users lsof, rather than the deployment user

https://gerrit.wikimedia.org/r/374832

We'll need a manual puppet run on ores1001.

I'm seeing some weirdness, just poking at a running production box scb1002. The uwsgi processes each have exactly 269 file handles pointing to /dev/zero. Celery workers each have exactly 268 open FIFO pipes. This makes it sound like we're sharing file handles across forked processes, possibly creating new handles every time we fork?

These numbers don't seem to fluctuate, though, so I'm not sure why load testing would cause file handle starvation.

awight moved this task from Unsorted to Active Tasks on the Machine-Learning-Team board.Sep 6 2017, 5:16 PM

awight edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

Change 377334 had a related patch set uploaded (by Awight; owner: Awight):
[research/ores/wheels@1.3.x] Bumps revscoring to 1.3.19

https://gerrit.wikimedia.org/r/377334

Change 377334 merged by Ladsgroup:
[research/ores/wheels@1.3.x] Bumps revscoring to 1.3.19

https://gerrit.wikimedia.org/r/377334

awight mentioned this in rORESWHEELS5ada84cd2fd2: Bumps revscoring to 1.3.19.Sep 11 2017, 8:21 PM

@akosiaris > We'll need a manual puppet run on ores1001.
Thanks!

In T174402#3598812, @awight wrote:

@akosiaris > We'll need a manual puppet run on ores1001.
Thanks!

I'm on clinic duty and was pinged for this, so I've re-enabled puppet and allowed it to run. (No errors.) I then re-disabled puppet per request, with a comment linking to this task.

So I just did the same, upon request, for ores1002. Howeve,r while it was running, I saw a number of user updates scroll past. It seems these are production network machines that sit without calling into puppet for awhile, which seems a security concern.

Chatted with @awight about this, and we're going to enable puppet back at the end of the day. This way hosts still call in and get updates (like user changes and removals) but are halted when they need to do active testing.

I'm stuck and unable to run the service (T175654), seemingly due to file handle issues. Restarting the celery-ores-worker service on ores1002 logs:

Unrecoverable error: OSError(24, 'Too many open files')
Traceback (most recent call last):
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/worker/__init__.py", line 206, in start
    self.blueprint.start(self)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/bootsteps.py", line 374, in start
    return self.obj.start()
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/concurrency/base.py", line 131, in start
    self.on_start()
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/concurrency/prefork.py", line 119, in on_start
    **self.options)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/concurrency/asynpool.py", line 367, in __init__
    for _ in range(processes))
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/concurrency/asynpool.py", line 367, in <genexpr>
    for _ in range(processes))
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/concurrency/asynpool.py", line 1018, in create_process_queues
    inq = _SimpleQueue(wnonblock=True)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/queues.py", line 314, in __init__
    duplex=False, rnonblock=rnonblock, wnonblock=wnonblock,
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/__init__.py", line 93, in Pipe
    return Pipe(duplex, rnonblock, wnonblock)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/py3/connection.py", line 540, in Pipe
    fd1, fd2 = os.pipe()
OSError: [Errno 24] Too many open files

The celery queue is full of unprocessed pending jobs, and uwsgi workers seem to be running but unable to do anything. The parent uwsgi process has 768 (suspiciously 256*3) open files, and child workers have 602 each. Parent has 170 open sockets, workers have 8. All uwsgi threads have 430 open connections to a deleted file.

FD_SETSIZE looks to be 1024, and the system FD limits are set to 65536.

In T174402#3598841, @RobH wrote:

So I just did the same, upon request, for ores1002. Howeve,r while it was running, I saw a number of user updates scroll past. It seems these are production network machines that sit without calling into puppet for awhile, which seems a security concern.

You are absolutely right of course. I did not expect that puppet disabling to last that long. Sorry about that. I just made sure puppet is enabled and has ran on all those hosts.

Chatted with @awight about this, and we're going to enable puppet back at the end of the day. This way hosts still call in and get updates (like user changes and removals) but are halted when they need to do active testing.

Good call.

I think I've been wrong all this time. The system's total open file count is suspiciously close to the configured limit of 65536.

@ores1002:~$ sudo lsof | wc -l
54246
@ores1002:~$ sudo service celery-ores-worker restart;
@ores1002:~$ sudo lsof | wc -l
60007
@ores1002:~$ sudo lsof | wc -l
60127
@ores1002:~$ sudo lsof | wc -l
60278

###
# App has crashed here with logs,
File "/srv/deployment/ores/venv/lib/python3.4/site-packages/billiard/py3/connection.py", line 540, in Pipe
    fd1, fd2 = os.pipe()
OSError: [Errno 24] Too many open files
###
# See that open files have dropped.

@ores1002:~$ sudo lsof | wc -l
54246

ores1002 limits are,

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 257256
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 257256
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@akosiaris If you wouldn't mind increasing the -n limit to something more unreachable?

ores1002 limits are,
ulimit -a
...
open files                      (-n) 65536
...
@akosiaris If you wouldn't mind increasing the -n limit to something more unreachable?

That is NOT the system limit. That is the shell's limit (and inhererited by processes spawned by the shell). The system limit is well above that, i.e.

$ /sbin/sysctl fs.file-max
fs.file-max = 6570496

The celery workers per process file number limit can be found out with

$ systemctl show -p LimitNOFILE celery-ores-worker.service 
LimitNOFILE=4096

Which means EACH AND EVERY celery worker process opens > 4k file descriptors and hence fails. Obvioulsy we can bump it and in fact we 've done so temporarily during the stress tests only to see the problem re-emerge. The value set was 16k IIRC and still the problem persisted. If you want we can set it again temporarily so you can debug further

Thanks for correcting my misunderstanding of ulimit!

I don't expect that we're hitting the 4,096 file descriptors per-process limit. That's the soft limit, right?

Fruitless adventure into the bowels of os.pipe

Diving deeper into the code, os.pipe() is a built-in function defined in https://github.com/python/cpython/blob/v3.4.2/Modules/posixmodule.c#L8388
This implementation calls pipe2 with the O_CLOEXEC flag, only falling back to pipe if the syscall isn't implemented. I demonstrated that pipe2 is working, by compiling and running: P6011, and checking that Python was compiled with HAVE_PIPE2:

>>> sysconfig.get_config_var('HAVE_PIPE2')
1

pipe2 is setting errno to EMFILE, which means just what you might think,

[EMFILE]           Too many descriptors are active.

Rabbithole contains a reasonably small family of rabbits.

Here's where it gets crazy, though. I was able to clone the deployment directory and run the celery server as my user, with horrifying results. It ran!

sudo lsof | wc -l
54280

python ores_celery.py
...

sudo lsof | wc -l
1166828

The number of open files skyrockets as the service starts. The parent process opens 2880 pipe endpoints (1440 pipes), and worker threads open about 2000 pipe endpoints. These numbers check out as pretty close to the expected, 2 uni-directional pipes x 2 endpoints per pipe x 480 workers = 1,920 endpoints. However, we're still only using 3,000 file descriptors for the parent process, so I can't explain why we would be hitting the per-process limit of 4k. The most suspicious lead I have at this point is that I'm able to run the service as my user, so it seems that service-specific or user-specific resource limits are indeed to blame.

IMO, 3-4k file descriptors per celery worker is extravagant, but not unreasonable. The count isn't running away to infinity, workers are forked fully initialized and not overly often. I would like to have the limits raised for us to survive c. 500 workers at 4k file descriptors each. 16k would be a good jump just for debugging. Even once we clean up the workers and close unused pipes, the master will still be too close for comfort, at 75% of the current per-process limit.

We can schedule future development tasks to optimize this number down. I (intuitively) assume that most of the pipes are unused for worker threads, and were wastefully duplicated from the parent due to forking, but I could be wrong if the workers can elect master succession, etc.

In T174402#3609620, @awight wrote:

Thanks for correcting my misunderstanding of ulimit!

I don't expect that we're hitting the 4,096 file descriptors per-process limit. That's the soft limit, right?

Both hard and soft.

In T174402#3609624, @awight wrote:
Here's where it gets crazy, though. I was able to clone the deployment directory and run the celery server as my user, with horrifying results. It ran!
sudo lsof | wc -l
54280

python ores_celery.py
...

sudo lsof | wc -l
1166828
The number of open files skyrockets as the service starts. The parent process opens 2880 pipe endpoints (1440 pipes), and worker threads open about 2000 pipe endpoints.

Why would a worker have 2k pipe endpoints though ? Who does it need to communicate with aside the master ?

These numbers check out as pretty close to the expected, 2 uni-directional pipes x 2 endpoints per pipe x 480 workers = 1,920 endpoints. However, we're still only using 3,000 file descriptors for the parent process, so I can't explain why we would be hitting the per-process limit of 4k. The most suspicious lead I have at this point is that I'm able to run the service as my user, so it seems that service-specific or user-specific resource limits are indeed to blame.

Keep in mind that FDs are also used for every open network connection. Connections to redis will also consume FDs. How many redis connection would a process start ?

IMO, 3-4k file descriptors per celery worker is extravagant, but not unreasonable.

It is not indeed. We can bump it. Should I double it ? Keep in mind that 4k fds per box limits us to ~1500 (it's actually 1600 but that does not count all the other fds used by the system) workers per box given the system wide limit. Which is possibly high enough for now.

The count isn't running away to infinity, workers are forked fully initialized and not overly often. I would like to have the limits raised for us to survive c. 500 workers at 4k file descriptors each. 16k would be a good jump just for debugging. Even once we clean up the workers and close unused pipes, the master will still be too close for comfort, at 75% of the current per-process limit.

500 workers per box that is ? Assuming those workers all reach 16k FDs, per the numbers above, that is a bit too much. I 'd rather we tried with 8k first. Is that OK with you ?

We can schedule future development tasks to optimize this number down. I (intuitively) assume that most of the pipes are unused for worker threads, and were wastefully duplicated from the parent due to forking, but I could be wrong if the workers can elect master succession, etc.

Yes, let's please do because 2000k open pipes per worker don't make sense to me right now. Probably some pipes need to be closed on forking.

Change 378239 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Bump celery ores worker # files limit

https://gerrit.wikimedia.org/r/378239

Change 378239 merged by Alexandros Kosiaris:
[operations/puppet@production] Bump celery ores worker # files limit

https://gerrit.wikimedia.org/r/378239

In the interest of moving this forward I 've increased the limit to 8k fds. Let me know if this is sufficient or not in tests.

@akosiaris This fixed the problem, thanks!

awight mentioned this in T177036: Clean up file handle and Redis connection management in ORES worker and celery processes.Sep 28 2017, 11:48 PM

awight created subtask T177036: Clean up file handle and Redis connection management in ORES worker and celery processes.

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Oct 3 2017, 3:48 PM

Reopening, I saw this error kill the celery worker on a few of our nodes during stress testing.

awight moved this task from Completed to Parked on the Machine-Learning-Team (Active Tasks) board.Oct 5 2017, 11:23 PM

@akosiaris We're now estimating that 480 workers will eventually use 220k–1.2M file handles due to a glitch in Celery that may take some time to fix. How about we set the fileno limit on the service to 2M, and continue testing?

I think that we can eventually raise the worker count much higher than 480, but this is good for now. We'll need some additional fixes and better memory monitoring before attempting a higher number of workers.

Given that up to now we are configuring the PER PROCESS file limit, can I ask how these numbers were calculated what do they refer to ? Given the way this is written I am guessing that the per process limit is in the range [450,2500] (i.e. [220k/480, 1.2M/480]) and we are well over that (we currently have a 8k limit). So something's wrong there (and that something may very well be my understanding of this).

I am also perplexed by the large range specified (220k-1.2M). Why is it that wide? And why is 2M a suggested value when the upper bound seems to be 1.2M? Finally, I am also curious what's the glitch in celery that you are referring to is. Got a link/document/anything I can read handy ?

@akosiaris, see my notes in T177036

@akosiaris Good questions! I only just now found good documentation for the LimitFILENO configuration variable, and you're right that I had given the limit as the expected total across all workers in the service. 8,192 should indeed be a good per-process limit. We'll get back to you here, after discussing some more.

I used lsof to watch filehandle usage over the lifecycle of the celery service: P6112

What I learned:

It takes about 1min 30sec to start up all 480 worker child processes.
Once the service has booted, it's stable at 1,114,969 total open file handles across all threads in the process group.
Under heavy load less than the maximum capacity, the number of open handles only climbs to 1,115,449 (exactly 480 more, one per worker).
During an unsupportably heavy load, the number of handles climbed by *1*.
At some point the service collapses, and all threads die.

Minor detail, at the moment the service dies, there is a flood of: (example)

statsd_send_metric()/sendto(): Resource temporarily unavailable [plugins/stats_pusher_statsd/plugin.c line 40]

The filehandle exhaustion error is logged at 20:26:09.570497591, the same second as the statsd errors. So that may be an effect, not a cause.

Now *this* is interesting. The Celery code involved in the failure, https://github.com/celery/celery/blob/3.1/celery/concurrency/asynpool.py#L141
has been rewritten to use poll() instead of select().
https://github.com/celery/celery/blob/master/celery/concurrency/asynpool.py#L121

This was changed in response to same bug we're experiencing: https://github.com/celery/celery/issues/2373

I think it would be safe to backport the fix into Celery 3.1, but we'd have to weigh the drawbacks of using a forked library, vs. doing the work to get ourselves compatible with Celery 4.

Before we have that discussion, we could just give it a shot to see if it works as expected -- *then* have the discussion.

Sounds good. It also sounds like the Celery developers would be open to a contributed backport.

Halfak claimed this task.Oct 16 2017, 6:57 PM

Halfak added a subscriber: Ladsgroup.

I just figured out how to upgrade to celery 4.1.0. See T178441. Once that passes review, we can give it a try to see if we still have the filehandle issue.

Halfak added a subtask: T178441: Upgrade celery to 4.1.0 for ORES.Oct 17 2017, 9:45 PM

@Halfak Potentially good news: using your Celery 4 patch, I was able to start the service under mw-vagrant and get beyond the select() limit of 1024 open files, until I hit the wall on configured memory :D. Next step is to re-run the stress test on the new ORES cluster, with this code. Great work!

Halfak moved this task from Parked to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Oct 27 2017, 5:21 PM

Ladsgroup removed a project: User-Ladsgroup.Nov 2 2017, 4:45 PM

awight moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Nov 2 2017, 6:10 PM