Decision request - What to use for toolforge components api task execution
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Apr 10 2024, 12:29 PM

Description

Problem

To be able to manage pipelines in the components api (build 1 + build 2 -> deploy component 1 -> deploy component 2 for example) we need some way to handle execution pipelines.

Constraints and risks

The pipelines don't need to be huge, dozens of components would be the biggest

Decision record

In progress

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T362224_What_to_use_for_toolforge_components_api_task_execution

Options

Option 1

No asynchronous task processing

Pros:

Easiest to implement

Cons:

If the request breaks, the whole pipeline fails

Option 2

After-request asynchronous task processing on the same thread (ex. https://fastapi.tiangolo.com/tutorial/background-tasks/)

Pros:

Easy to implement and setup (no extra components/services/etc. needed)

Cons:

If the service gets restarted (ex. OOM, moving to a different worker) the pipeline breaks

Option 3

Using tekton pipelines

Pros:

Already used for build service
Good pipeline support (made for it)

Cons:

Relatively complex to setup
Pipelines written in tekton yaml + custom image/shell script
Needs interfacing with tekton (similar to what we do with builds service)

Option 4

Using celery with redis

Pros:

Very common pattern
Good pipeline support (made for it)
Pipelines written in the same code than the service

Cons:

Needs a redis instance (we should not reuse the user-facing one for security reasons I think)

Option N

Add you option here!

Related Objects
Search...

Status	Assigned	Task
Resolved	LucasWerkmeister	T320140 Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Resolved	matmarex	T319707 Migrate dtcheck from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320062 Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320011 Migrate rfa-voting-history from Toolforge GridEngine to Toolforge Kubernetes
Open	dcaro	T194332 [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs
Open	dcaro	T368600 [KR] WE6.3 Introduce a sustainability scoring system for the Toolforge platform
In Progress	dcaro	T375199 [Hypothesis] WE6.3.4 If we enable the automatic deployment of a minimal tool, we will be able to evaluate the end to end flow and set the groundwork for adding support for more complex tools and deployment flows.
In Progress	dcaro	T362051 [components-api] First iteration of the component API
Open	dcaro	T362072 [components-api] Add support for non-public services
Resolved	dcaro	T362076 [components-api] Add support for pre-built images (ex. python3.11, to refine)
Open	dcaro	T362075 [components-api] add one-off, scheduled and continuous jobs support to the yaml + api
Resolved	dcaro	T362224 Decision request - What to use for toolforge components api task execution

Event Timeline

dcaro created this task.Apr 10 2024, 12:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2024, 12:29 PM

dcaro updated the task description. (Show Details)Apr 10 2024, 12:29 PM

dcaro triaged this task as High priority.Apr 10 2024, 12:37 PM

dcaro added a parent task: T362075: [components-api] add one-off, scheduled and continuous jobs support to the yaml + api.

dcaro mentioned this in T362075: [components-api] add one-off, scheduled and continuous jobs support to the yaml + api.Apr 10 2024, 12:52 PM

fnegri subscribed.Apr 11 2024, 3:22 PM

Slst2020 subscribed.Apr 11 2024, 3:23 PM

fnegri moved this task from Inbox to Discussion on the Cloud Services Proposals board.Apr 11 2024, 3:26 PM

I vote for Option 1, I think we should aim for the easiest solution in the first MVP of the component API.

I could be wrong, but I don't imagine having async processing is a requirement for most tools, see also my comment in T362075. Adding async processing could become a feature request that can be prioritized based on how many users are interested.

Would #1 mean having to keep the single HTTP request alive for the duration of the entire build + deploy? I'd prefer to avoid anything requiring a single HTTP request to stay alive for multiple minutes (or longer) :/

I would also avoid that, but I think Option 1 doesn't necessarily entail a long HTTP request, it could be a trigger+poll mechanism for example. I'm not even sure if in the first MVP we need an endpoint that builds all the components at once, or if we could have a separate trigger per each component.

In T362224#9726357, @fnegri wrote:

I would also avoid that, but I think Option 1 doesn't necessarily entail a long HTTP request, it could be a trigger+poll mechanism for example. I'm not even sure if in the first MVP we need an endpoint that builds all the components at once, or if we could have a separate trigger per each component.

I would mean one long HTTP request yes, that's what I meant with synchronous.

By asynchronous (option 2, 3, 4) I mean, as in with the http request, you start a background task (using the same python process - option 2, or using an asynchronous batch processing system - option 3 and 4) get an ID, and then with the polling you check on the status of the background task.

Option 1 for an MVP, then iterate on it as needed.

I would mean one long HTTP request yes, that's what I meant with synchronous.

I see, thanks for clarifying.

If we take the simple case of a tool with a single component that we need to build+deploy, is the idea that the component-api will call the builds-api, wait for the build to finish, then call the K8s API to deploy the new image?

There's indeed some complexity to keep track of the status of a given build. I don't like the idea of an API that takes minutes to respond to an HTTP call (not even in the MVP), so we need to store somewhere that a component-api operation is in progress, and be able to retrieve its status.

I'm starting to think that Option 3 (using Tekton) might make sense.

This is also related to T362069: [components-api] Get a skeleton of API webservice and implement `/tool/<toolname>/deploy` with single continuous job deployment only where you discussed the ID+polling implementation for component-api.

In T362224#9727358, @fnegri wrote:

I would mean one long HTTP request yes, that's what I meant with synchronous.

I see, thanks for clarifying.

If we take the simple case of a tool with a single component that we need to build+deploy, is the idea that the component-api will call the builds-api, wait for the build to finish, then call the K8s API to deploy the new image?

Yep (using the jobs-api probably too).

There's indeed some complexity to keep track of the status of a given build. I don't like the idea of an API that takes minutes to respond to an HTTP call (not even in the MVP), so we need to store somewhere that a component-api operation is in progress, and be able to retrieve its status.

I'm starting to think that Option 3 (using Tekton) might make sense.

This is also related to T362069: [components-api] Get a skeleton of API webservice and implement `/tool/<toolname>/deploy` with single continuous job deployment only where you discussed the ID+polling implementation for component-api.

Yep, thanks for pointing it out, it's relevant. In that task as we only do a build, we don't really need the "pipeline" behavior (if the build passes then start the job), so we can do with a one-off kind of request at the beginning, this task might change that "give id then poll" mechanism, depending on what we choose.

I think either options 2 or 3 are fine for me.

With the assumptions that:

option 2 is maybe faster to implement today
option 3 is more robust, and more likely to result in a good choice in the long run

Moreover, I could be interested in playing with tekton myself if we go with option 3.

per the fast-api docs https://fastapi.tiangolo.com/tutorial/background-tasks/#caveat:

Caveat¶

If you need to perform heavy background computation and you don't necessarily need it to be run by the same process (for example, you don't need to share memory, variables, etc), you might benefit from using other bigger tools like Celery.

They tend to require more complex configurations, a message/job queue manager, like RabbitMQ or Redis, but they allow you to run background tasks in multiple processes, and especially, in multiple servers.

I think starting builds and deploying jobs/webservices may qualify for heavy background computation. In the sense that it can take minutes to complete.

Therefore I will drop my support for option 2, and go for option 3.

We had the decision meeting today and the option chosen was to start with 2, and then evolve to 4 if that ends up not being enough.

There's the question for option 2 of how to keep track of the status of the background tasks that we will have to figure out (probably updating the deployment model or similar).

I will write the decision request wiki page shortly.

dcaro closed this task as Resolved.Jul 15 2024, 12:39 PM

dcaro claimed this task.

Decision request - What to use for toolforge components api task executionClosed, ResolvedPublicActions

Description

Problem

Constraints and risks

Decision record

Options

Option 1

Option 2

Option 3

Option 4

Option N

Related ObjectsSearch...

Event Timeline

Decision request - What to use for toolforge components api task execution
Closed, ResolvedPublic
Actions

Related Objects
Search...