Set up backpressure for ORES (Limit queue sizes in Celery)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Oct 14 2015, 10:33 PM

Description

There should be a way to limit the input queue size in Celery.

Related Objects
Search...

Status	Assigned	Task
Resolved	Ladsgroup	T130210 [Epic] Deploy ORES extension as a Beta feature
Resolved	Halfak	T140002 [Epic] Deploy ORES review tool
Resolved	Ladsgroup	T130212 Deploy ORES review tool to wikidatawiki
Invalid	Halfak	T106398 Revscoring tasks from Wikimania discussions
Resolved	Halfak	T106867 [Epic] Deploy Revscoring/ORES service in Prod
Resolved	akosiaris	T117560 New Service Request: ORES
Resolved	Halfak	T115534 Set up backpressure for ORES (Limit queue sizes in Celery)

Event Timeline

Halfak created this task.Oct 14 2015, 10:33 PM

Halfak claimed this task.

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Backlog on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2015, 10:33 PM

OK. So, I've done a bunch of research. I'm amazed that there's no clear way to do this within the context of celery. AFAICT, the best way we could manage queue size would be to directly query (internally managed) celery keys in redis and ask for the length of the queue before calling task.apply_async().

I've filed a bug asking for a maxsize config on task queues and got a quick response. We'll see where that goes. It might make sense in the short-term to look for other options for implementing back-pressure. For example, we might limit the number of parallel requests that can be handled by wsgi. We can might also look into reducing the task timeout. That way, celery would not let an item be queued for more than 'timeout' seconds.

Halfak added a parent task: T117560: New Service Request: ORES.Nov 3 2015, 4:18 PM

Well, it's a big mess, but I've hacked it together. See https://github.com/wiki-ai/ores/pull/102

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 3 2015, 9:07 PM

Halfak added a project: ORES.Nov 4 2015, 5:26 PM

Halfak set Security to None.

I've been stuck on a good way to test this in staging. I just figured out how to slow down celery and not slow down uwsgi enough to get the process to be replicable and to make sure that our backpressure is enacted by the celery queue.

I've pasted my notes from IRC below.

[17:17:56] <halfak> I just had an idea about what could be causing the queue to not fill up on staging. 
[17:18:27] <halfak> Maybe by maxing CPU, I'm blocking the code that precedes addition to the celery queue as well as the celery jobs. 
[17:18:42] <halfak> So, the uwsgi queue gets full, 
[17:19:12] <halfak> In order to test this, I need to "stess" the celery workers without "stress"ing the uwsgi process that happens after the initial queue
[17:19:39] <halfak> I think I'm going to hack the code to induce a sleep() in the celery processor. 
[17:19:54] <halfak> Since we load ORES from a submodule, this should be pretty easy. 
[17:20:14] <halfak> Yup... I finished typing that and it still makes sense.  ONWARD
[17:38:02] <halfak> It works! 
[17:38:06] <halfak> YuviPanda, :DDDDDD
[17:38:07] <halfak> WOOOOOO
[17:38:33] <halfak> Lesson learned.  When trying to stress test celery, don't also stress test uwsgi or it will be have weirdly. 
[17:38:43] <halfak> Now to figure out how many uwsgi processes we really need. 
[17:53:32] <halfak> So... any time I restart uwsgi now, I get a connection dropped with the staging server
[17:53:46] <halfak> But it still seems to restart
[17:57:13] <halfak> OK... So I think I've figured out another problem.  uwsgi will hang onto a request for a long time.  We should have uwsgi kill a request that takes celery-timeout (15s) + 5 seconds = 20 seconds.
[18:19:45] <halfak> Looks like we can hit our queue maxsize easily with 128 uwsgi process, so I'll leave it there
[18:20:04] <halfak> We might want to cut that in half for prod since we'll have two uwsgi servers.

So, it looks like this patch works as expected.

Halfak moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.Nov 12 2015, 6:31 PM

Halfak renamed this task from Limit queue sizes in Celery to Set up backpressure for ORES (Limit queue sizes in Celery).Nov 20 2015, 2:38 PM

Halfak closed this task as Resolved.

Halfak added a parent task: T106867: [Epic] Deploy Revscoring/ORES service in Prod.

Set up backpressure for ORES (Limit queue sizes in Celery)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Set up backpressure for ORES (Limit queue sizes in Celery)
Closed, ResolvedPublic
Actions

Related Objects
Search...