Page MenuHomePhabricator

[Spike] Implement & test dependent tasks in Celery
Closed, ResolvedPublic

Description

This task is done when there is a toy implementation of dependent tasks using celery.

Event Timeline

See repo here: https://github.com/halfak/dependent_task_testing

I created some test output here that demonstrates the whole system works together: https://gist.github.com/halfak/277975af5de6153719e1985d06210a23

I tested it with 4 requesters running in sequence (like precached) and 3 requesters running on shuffle (like random score requests) and found no deadlocking problems.

@schana, any notes on the implementation?

@Halfak, how is this going to share/get the extracted features between requests? It seems that this is just another caching area for results. I think I'm also against the idea of combining functionality into a super-task, instead of decomposing the different parts into their own, independent units (what I was trying to do with the data flow diagram).

The "supertask" is where the sharing takes place. Note that multiple models are applied in the "score_many_models".

I'm also against the idea of combining functionality into a super-task

Well, the computation isn't really a combined functionality, but rather the structure represents how computation can most easily flow (applying multiple models in sequence using the same cache).

Oh! I just realized that you might be imagining something different from me. I don't want to cache features between requests. Instead, I want to re-use the feature-extraction-cache to generate multiple scores for a single revision.

I don't want to cache features between requests.

Why not?

Instead, I want to re-use the feature-extraction-cache to generate multiple scores for a single revision.

I think a low-level object like this cache should be fully encapsulated within a task. I also think breaking up the tasks by units of work would help simplify the code base; the scoring processor seems to have a lot of responsibilities.

Why not?

Because it would be very complicated to apply to the dependency injection system used to score revisions. We'd need to cache whole trees -- not just features. Lots of data would need to be stored. Further, it is an entirely separate task than the one that inspired this one.

breaking up the tasks by units of work would help simplify the code base

You keep saying this. This is the way that ORES currently works. We have a *problem* with the task units splitting data that ought not to be split. What, exactly, is the problem with performing all computations relevant to a chunk of data in a single task?

Which responsibilities do you think are inappropriate for ScoreProcessor?