There should be a way to limit the input queue size in Celery.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ladsgroup | T130210 [Epic] Deploy ORES extension as a Beta feature | |||
Resolved | Halfak | T140002 [Epic] Deploy ORES review tool | |||
Resolved | Ladsgroup | T130212 Deploy ORES review tool to wikidatawiki | |||
Invalid | Halfak | T106398 Revscoring tasks from Wikimania discussions | |||
Resolved | Halfak | T106867 [Epic] Deploy Revscoring/ORES service in Prod | |||
Resolved | akosiaris | T117560 New Service Request: ORES | |||
Resolved | Halfak | T115534 Set up backpressure for ORES (Limit queue sizes in Celery) |
Event Timeline
OK. So, I've done a bunch of research. I'm amazed that there's no clear way to do this within the context of celery. AFAICT, the best way we could manage queue size would be to directly query (internally managed) celery keys in redis and ask for the length of the queue before calling task.apply_async().
I've filed a bug asking for a maxsize config on task queues and got a quick response. We'll see where that goes. It might make sense in the short-term to look for other options for implementing back-pressure. For example, we might limit the number of parallel requests that can be handled by wsgi. We can might also look into reducing the task timeout. That way, celery would not let an item be queued for more than 'timeout' seconds.
Well, it's a big mess, but I've hacked it together. See https://github.com/wiki-ai/ores/pull/102
I've been stuck on a good way to test this in staging. I just figured out how to slow down celery and not slow down uwsgi enough to get the process to be replicable and to make sure that our backpressure is enacted by the celery queue.
I've pasted my notes from IRC below.
[17:17:56] <halfak> I just had an idea about what could be causing the queue to not fill up on staging. [17:18:27] <halfak> Maybe by maxing CPU, I'm blocking the code that precedes addition to the celery queue as well as the celery jobs. [17:18:42] <halfak> So, the uwsgi queue gets full, [17:19:12] <halfak> In order to test this, I need to "stess" the celery workers without "stress"ing the uwsgi process that happens after the initial queue [17:19:39] <halfak> I think I'm going to hack the code to induce a sleep() in the celery processor. [17:19:54] <halfak> Since we load ORES from a submodule, this should be pretty easy. [17:20:14] <halfak> Yup... I finished typing that and it still makes sense. ONWARD [17:38:02] <halfak> It works! [17:38:06] <halfak> YuviPanda, :DDDDDD [17:38:07] <halfak> WOOOOOO [17:38:33] <halfak> Lesson learned. When trying to stress test celery, don't also stress test uwsgi or it will be have weirdly. [17:38:43] <halfak> Now to figure out how many uwsgi processes we really need. [17:53:32] <halfak> So... any time I restart uwsgi now, I get a connection dropped with the staging server [17:53:46] <halfak> But it still seems to restart [17:57:13] <halfak> OK... So I think I've figured out another problem. uwsgi will hang onto a request for a long time. We should have uwsgi kill a request that takes celery-timeout (15s) + 5 seconds = 20 seconds. [18:19:45] <halfak> Looks like we can hit our queue maxsize easily with 128 uwsgi process, so I'll leave it there [18:20:04] <halfak> We might want to cut that in half for prod since we'll have two uwsgi servers.
So, it looks like this patch works as expected.