Page MenuHomePhabricator

Decision request - What to use for toolforge components api task execution
Closed, ResolvedPublic

Description

Problem

To be able to manage pipelines in the components api (build 1 + build 2 -> deploy component 1 -> deploy component 2 for example) we need some way to handle execution pipelines.

Constraints and risks

  • The pipelines don't need to be huge, dozens of components would be the biggest

Decision record

In progress

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T362224_What_to_use_for_toolforge_components_api_task_execution

Options

Option 1

No asynchronous task processing

Pros:

  • Easiest to implement

Cons:

  • If the request breaks, the whole pipeline fails

Option 2

After-request asynchronous task processing on the same thread (ex. https://fastapi.tiangolo.com/tutorial/background-tasks/)

Pros:

  • Easy to implement and setup (no extra components/services/etc. needed)

Cons:

  • If the service gets restarted (ex. OOM, moving to a different worker) the pipeline breaks

Option 3

Using tekton pipelines

Pros:

  • Already used for build service
  • Good pipeline support (made for it)

Cons:

  • Relatively complex to setup
  • Pipelines written in tekton yaml + custom image/shell script
  • Needs interfacing with tekton (similar to what we do with builds service)

Option 4

Using celery with redis

Pros:

  • Very common pattern
  • Good pipeline support (made for it)
  • Pipelines written in the same code than the service

Cons:

  • Needs a redis instance (we should not reuse the user-facing one for security reasons I think)

Option N

Add you option here!

Related Objects

Event Timeline

I vote for Option 1, I think we should aim for the easiest solution in the first MVP of the component API.

I could be wrong, but I don't imagine having async processing is a requirement for most tools, see also my comment in T362075. Adding async processing could become a feature request that can be prioritized based on how many users are interested.

Would #1 mean having to keep the single HTTP request alive for the duration of the entire build + deploy? I'd prefer to avoid anything requiring a single HTTP request to stay alive for multiple minutes (or longer) :/

I would also avoid that, but I think Option 1 doesn't necessarily entail a long HTTP request, it could be a trigger+poll mechanism for example. I'm not even sure if in the first MVP we need an endpoint that builds all the components at once, or if we could have a separate trigger per each component.

I would also avoid that, but I think Option 1 doesn't necessarily entail a long HTTP request, it could be a trigger+poll mechanism for example. I'm not even sure if in the first MVP we need an endpoint that builds all the components at once, or if we could have a separate trigger per each component.

I would mean one long HTTP request yes, that's what I meant with synchronous.

By asynchronous (option 2, 3, 4) I mean, as in with the http request, you start a background task (using the same python process - option 2, or using an asynchronous batch processing system - option 3 and 4) get an ID, and then with the polling you check on the status of the background task.

Option 1 for an MVP, then iterate on it as needed.

I would mean one long HTTP request yes, that's what I meant with synchronous.

I see, thanks for clarifying.

If we take the simple case of a tool with a single component that we need to build+deploy, is the idea that the component-api will call the builds-api, wait for the build to finish, then call the K8s API to deploy the new image?

There's indeed some complexity to keep track of the status of a given build. I don't like the idea of an API that takes minutes to respond to an HTTP call (not even in the MVP), so we need to store somewhere that a component-api operation is in progress, and be able to retrieve its status.

I'm starting to think that Option 3 (using Tekton) might make sense.

This is also related to T362069: [components-api] Get a skeleton of API webservice and implement `/tool/<toolname>/deploy` with single continuous job deployment only where you discussed the ID+polling implementation for component-api.

I would mean one long HTTP request yes, that's what I meant with synchronous.

I see, thanks for clarifying.

If we take the simple case of a tool with a single component that we need to build+deploy, is the idea that the component-api will call the builds-api, wait for the build to finish, then call the K8s API to deploy the new image?

Yep (using the jobs-api probably too).

There's indeed some complexity to keep track of the status of a given build. I don't like the idea of an API that takes minutes to respond to an HTTP call (not even in the MVP), so we need to store somewhere that a component-api operation is in progress, and be able to retrieve its status.

I'm starting to think that Option 3 (using Tekton) might make sense.

This is also related to T362069: [components-api] Get a skeleton of API webservice and implement `/tool/<toolname>/deploy` with single continuous job deployment only where you discussed the ID+polling implementation for component-api.

Yep, thanks for pointing it out, it's relevant. In that task as we only do a build, we don't really need the "pipeline" behavior (if the build passes then start the job), so we can do with a one-off kind of request at the beginning, this task might change that "give id then poll" mechanism, depending on what we choose.

I think either options 2 or 3 are fine for me.

With the assumptions that:

  • option 2 is maybe faster to implement today
  • option 3 is more robust, and more likely to result in a good choice in the long run

Moreover, I could be interested in playing with tekton myself if we go with option 3.

per the fast-api docs https://fastapi.tiangolo.com/tutorial/background-tasks/#caveat:

Caveat¶

If you need to perform heavy background computation and you don't necessarily need it to be run by the same process (for example, you don't need to share memory, variables, etc), you might benefit from using other bigger tools like Celery.

They tend to require more complex configurations, a message/job queue manager, like RabbitMQ or Redis, but they allow you to run background tasks in multiple processes, and especially, in multiple servers.

I think starting builds and deploying jobs/webservices may qualify for heavy background computation. In the sense that it can take minutes to complete.

Therefore I will drop my support for option 2, and go for option 3.

We had the decision meeting today and the option chosen was to start with 2, and then evolve to 4 if that ends up not being enough.

There's the question for option 2 of how to keep track of the status of the background tasks that we will have to figure out (probably updating the deployment model or similar).

I will write the decision request wiki page shortly.

dcaro claimed this task.