Decision request – Toolforge (re)architecture
Closed, ResolvedPublic
Actions

Description

Problem

Over the last year, the Toolforge ecosystem has evolved beyond 'jobs' and 'webservice' to include the new build system, featuring a system for environment variables and secrets, a potential deploy subcommand, and more. Now might be a good time to take a step back and reassess the architectural foundation that supports these functionalities. As we continue to expand and add new features, it's crucial to ensure that our architecture is scalable, maintainable, and aligned with both our long-term vision and immediate operational needs.

References:
T342077: Toolforge beyond build service
Toolforge work group meeting notes

Decision Record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T346153_Toolforge_(re)architecture

Risks & Constraints

...if the architecture _Doesn't Evolve_:

Increased Complexity: The architecture could become increasingly complex, making it harder to manage, maintain, and onboard new contributors.
Resource Inefficiency: The current architecture might require more resources for maintenance than a new, more efficient architecture, leading to wasteful allocation of engineering time
Community Disengagement: The existing complexity may deter new contributions.

...if the architecture _Does Evolve_:

Implementation Challenges: Transitioning to a new architecture could be resource-intensive (engineering-time wise) and meet resistance.
Backward Compatibility: Changes must consider the impact on existing services, posing a risk of breaking functionalities or affecting the user experience
Operational Overhead: Unexpected complexities in deployment and monitoring may arise, requiring more operational effort.

Options

Some pros and cons of each option have been listed. For further context and in-depth discussion of the finer points of the different options, see the references linked above.

Option 1

Backend(API gateway + several per-service APIs) + Client(Single codebase/monolith)
Pros:

Decoupling between frontend and backend through API gateway.
Simpler to improve user client experience with a single package
Client easier to do big contributions (all code together, shipped as one)
Backend easier to do small contributions (easier to understand/test/depoly just one small part)
Increased flexibility and scalability
Backend easy to move to a monolithic system
Backend easier for others to reuse outside toolforge

Cons:

Client hard to move to a decoupled system
Client harder to do small contributions (must test all flows for any change, must understand the whole system)
Backend harder to do big contributions (split repos, split deployments)
Potential for backend operational complexity due to multiple service APIs.
API gateway introduces an additional system to maintain
Backend split repos could end up in a high degree of code repetition and boilerplate.
Client harder for others to reuse outside toolforge

Option 2

Backend(Single API service with all services in it) + Client(Single codebase/monolith)
Pros:

Highly integrated and simplified operation/deployment
Simpler to improve user client experience with a single package
All easier to do big contributions (all code together, shipped as one)
Remove API gateway system (functionality must be re-written in the API though)

Cons:

All hard to move to a decoupled system
All harder to do small contributions (must test all flows for any change, must understand the whole system)
Reduced flexibility and scalability
Tight coupling could make future changes more challenging
All harder for others to reuse outside toolforge

Option 3

Backend(API gateway + per-service APIs) + Client(Per-service codebases) - Status Quo
Pros:

Decoupling between frontend and backend through API gateway.
Existing familiarity and no immediate changes required
High degree of decoupling allows for services to be managed independently from development to deployment
All easier to do small contributions (easier to understand/test/deploy just one small service)
All easy to move to a monolithic system
All easier for others to reuse outside toolforge

Cons:

All harder to do big contributions (split repos, split deployments)
More complex to improve client experience with a single package
Potential for backend operational complexity due to multiple service APIs.
Harder to reason about the system as a whole
All split repos could end up in a high degree of code repetition and boilerplate.

Option N

Add your option here!

Note for Future Decisions

Specifics such as the "slim/smart" nature of the API gateway and inter-service communications, moving the CLIs to Go or not etc., will be left for a second round of decisions. This allows us to focus on immediate architectural choices first.

Questions for Consideration

How do each of these options align with our pre-defined goals and criteria?
What impact will the chosen architecture have on team communication and development practices?

Criteria for Evaluation

Non-exhaustive list, in no particular order. Feel free to add your own.

Reducing Complexity: How well does the option simplify the overall architecture and make it easier to manage and reason about?
Scalability: How well can the architecture handle increasing user demand and feature expansion?
Maintainability: What level of effort is required to maintain the system, including bug fixes, updates, and adding new features?
Team Alignment: Does the architecture align with the team's structure and practices, allowing for effective collaboration and contribution?
Ease of Installation: How straightforward is it to install the system, both for developers and end-users?
Ease of Contribution: How accessible is the system for new contributors?
User Experience: Does the architecture facilitate a seamless and efficient experience for the end-users?
Operational Complexity: What is the impact of the architecture on deployment, monitoring, and logging?
Iterative Development: How well does the architecture support iterative development and adjustments over time?

Related Objects
Search...

Status	Assigned	Task
Resolved	LucasWerkmeister	T320140 Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Resolved	matmarex	T319707 Migrate dtcheck from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320062 Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320011 Migrate rfa-voting-history from Toolforge GridEngine to Toolforge Kubernetes
Open	dcaro	T194332 [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs
Resolved	Slst2020	T342077 Toolforge beyond build service
Resolved	Slst2020	T346153 Decision request – Toolforge (re)architecture

Event Timeline

Slst2020 created this task.Sep 12 2023, 1:50 PM

dcaro changed the task status from Open to In Progress.Sep 12 2023, 3:59 PM

dcaro moved this task from Next Up to In Progress on the Toolforge Build Service (Iteration 19) board.

Great start! Some comments on the pros/cons listed

For option 1:

Decoupling between frontend and backend through API gateway.

This applies to all the solutions in the list

Simplified client experience with a single codebase

Is this for the user or the developer?
For the user, it's unrelated, as building one package with all the clients gives the same experience with many codebases.
For the developer, it's nuanced, as development and testing is easier (as you only focus on the specific service, that is easier to deploy, test, understand and debug), though the releasing is more complex.
For the contributor, is easier, as they only have to understand one single API of the whole setup, test less amount of flows, not need to deploy the whole setup locally to taste a change, etc.

Easier to manage and reason about

Compared to what? The current state? The other options?
The complexity of the service is still there (jobs, builds, deploy, webservice, admission controllers, webhooks, harbor, tekton, ...), the simplification comes potentially from the build+deployment of the client.

For example, I consider a bunch of smaller focused APIs way easier to contribute to than a big monolith, as I don't have to learn about the whole system, but focus on only one small part (ex. envvars api), where there's a smaller amount of code, little to no coupling with other APIs, easier to test and deploy locally, and easier to deploy and test live (you only have to test the envvars, not everything every time you deploy).

The high degree of repetition is "easily" solvable with libraries, shared code (ex. monorepo), and it's not inherent to having per-service APIs or one big monolith. You can have a monolith and also have a high degree of repetition too.

I think it's missing also a couple points:

con: coupled deployment, upgrading one client means upgrading all of them
con: coupled testing, changing one client means that we have to test all of them (as they change together)
con: harder to reason about each separated component (you need to know the whole)
con: harder to do smaller contributions (you need to know the whole)
pro: easier to extend, add and remove new services on the backend

For option 2:

Highly integrated and simplified architecture

I'd point here that it's at the deployment level, at the code/service/flow level it still has all the same complexity (jobs, builds, emailer, auth, tekton, deploy, envvars, ...). I think that the phrasing used in option 3 might be nicer, like "Low complexity in orchestration and management".

Easier for newcomers and lowers the barrier for contributions

This would be only for team members, that want to help support the whole system, for external contributors or small point-in-time contributions splitting the code in cohesive units simplifies contributing to it (and of course, with that comes testing, deploying, etc.).
Something like "Understanding the whole system is easier, understanding only a part of it is more complicated".

That's why there's the con "Tight coupling could make future changes more challenging", because changing a part becomes harder without changing the whole.

As in option 1, I'd add the cons:

con: coupled deployment, releasing one client means upgrading all of them
con: coupled testing, changing one client means that we have to test all of them (as they change together)
con: harder to reason about each separated component (you need to know the whole)
con: harder to do smaller contributions (you need to know the whole)
con: harder to extend the backend and the client

For option 3:

Allows for specialized focus on different services

I would say "simplified" also, it's easier to contribute to a smaller part without having to know the whole, easier to deploy a part, and easier to test a part.

Harder to onboard new contributors

As before, this is only if they want to contribute to the whole system, it's easier to do smaller contributions as each smaller service is easier to understand, test and deploy than the whole.
I might rephrase as "harder to do big contributions (harder to know the whole)"

High degree of code repetition and boilerplate across different repositories

I think that this can be solved in many ways, like using a monorepo and/or shared tooling.

I would add to the pros:

pro: It's easier to move to the other from this one, from the others to this one (joining the services than splitting the monolith).
pro: easy to do small contributions and reason about each component
pro: easier to extend, add and remove new services both client and backend

Maybe we should try to use the list of criteria that you added at the bottom, like:

Opiton 1

Reducing Complexity: 0

Hard to reason about the whole system
Easy to reason about a single API (though the client is bound to all, so a change in the client requires knowing the whole)

Scalability: +1

pro: Easy to scale different parts of the system

Maintainability: 0

con: Hard to add new features and services on the client
pro: Easy to add new features and service on the backend
con: Hard to deploy client fixes (you need to test all the client flows)
pro: Easy to deploy backend fixes (you need to test only one service)
pro: Easier to debug the client (single codebase)
con: Hard to debug the backend (distributed)

Team Alignment: +2

pro: Allows for smaller individual groups to work in parallel on the backend
con: Does not allow for individual groups to work in parallel on the cli
pro: Allows for a single group to work closely together in the backend
pro: Allows for a single group to work closely together in the cli

Ease of Installation (note that we don't have end-users yet, so imo the client side should have less weight): 0

con: Hard to deploy the backend (many services/deployments)
pro: Easy to deploy the client (single package)

Ease of Contribution: 0

con: Harder to get started on the client, as you have to know all the client interactions with all the services
pro: Easier to get started on the backend, as you only need to understand one of the subservices
pro: Easier to do big contributions on the client, as there's only one single codebase (I guess?)
con: Harder to do big contributions on the backend, as there's many services

User Experience: +3

pro: Single API interface
pro: Single client interface
pro: Single client package

Operational Complexity: +2

pro: Easy client release package-wise (only one package/version to release)
con: Complicates the testing of the client side (have to test all the flows)
con: Complicated whole backend release (many services)
con: Hard to test the whole backend release
pro: Easier single-service backend releases (single service)
pro: Easier to test a single service (don't need to test the whole system, just the deployed unit)
pro: Reliable backend deployment (minimized the impact of failures to a single subservice)
pro: Backend easier to scale
con: Backend harder to debug (distributed system)
pro: Client easier to debug

Iterative Development: 0

Easy to add/remove or modify small pieces of the backend, as only a single subservice would be affected
Harder to add/remove or modify small pieces of the client, as all the flows will have to be tested

Then we can summarize and compare:

Option 2 (following the same points as above)

Reducing Complexity: 0
Scalability: -1
Maintainability: -2
Team Alignment: 0
Ease of Installation: +2
Ease of Contribution: 0
User Experience: +3
Operational Complexity: +1
Iterative Development: -2

Etc.

@dcaro Your detailed breakdown of the pros and cons for each architectural option adds a lot of depth to this discussion. I particularly appreciate how you've considered the implications from multiple angles—user, developer, and contributor. I agree that the pros and cons are not one-size-fits-all and can vary depending on the perspective one is coming from.

To me, this stage of the decision making process is about gathering as many diverse perspectives as possible. In my mind, the primary function of listing pros and cons and some decision criteria is not so that we can form a definite consensus about what these are, but rather as an aid in thinking about the different options and getting the conversation going. One person’s cons may very well be considered neutral or even positive by someone else. We all have different viewpoints and preferences, so it's unlikely we'll agree on exact point-values for each criterion or option.

This is a collaborative effort, so feel free to make any edits you believe would improve the DR description.

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.Sep 13 2023, 12:55 PM

dcaro edited projects, added Toolforge Build Service (Iteration 20); removed Toolforge Build Service (Iteration 19).Sep 13 2023, 2:27 PM

dcaro moved this task from Next Up to In Progress on the Toolforge Build Service (Iteration 20) board.

This is a collaborative effort, so feel free to make any edits you believe would improve the DR description.

I don't want to step on anyone's toes, and I do recommend you being the one doing those changes to avoid it, but I'll do as you ask :)

dcaro updated the task description. (Show Details)Sep 13 2023, 8:54 PM

Slst2020 moved this task from In Progress to Blocked/Paused on the Toolforge Build Service (Iteration 20) board.Sep 26 2023, 7:09 AM

dcaro edited projects, added Toolforge (Toolforge iteration 00); removed Toolforge Build Service (Iteration 20).Sep 26 2023, 2:24 PM

dcaro moved this task from Next Up to Blocked/Paused on the Toolforge (Toolforge iteration 00) board.

Slst2020 moved this task from Blocked/Paused to In Progress on the Toolforge (Toolforge iteration 00) board.Oct 4 2023, 7:48 AM

I'd go for option 3, with a focus on getting a unified openapi definition at the api-gateway level, point at which we can switch to option 1, generating the client from it.

This make the entry-level contributions simpler (no need to understand the whole), and points towards a direction in which users would be able to install a single binary without increasing the maintenance burden on us (by generating the client).

I think I have a slight preference for option 1, as it seems a good intermediate goal between option 3 (status quo) and option 2 (which involves refactoring both the API and the client).

What @dcaro is suggesting is slightly different from the status quo, because it involves creating a unified OpenAPI definition, so we might call it "Option 4"? This would be my second choice.

point at which we can switch to option 2

Perhaps you meant "switch to option 1"? If I understand correctly, you're suggesting that after we implement the OpenAPI definition at the gateway level, we could then generate a unified client, while we keep separate API backends. This sounds more like option 1 than option 2, but maybe I'm misreading your suggestion.

In T346153#9224779, @fnegri wrote:

I think I have a slight preference for option 1, as it seems a good intermediate goal between option 3 (status quo) and option 2 (which involves refactoring both the API and the client).

What @dcaro is suggesting is slightly different from the status quo, because it involves creating a unified OpenAPI definition, so we might call it "Option 4"? This would be my second choice.

point at which we can switch to option 2

Perhaps you meant "switch to option 1"? If I understand correctly, you're suggesting that after we implement the OpenAPI definition at the gateway level, we could then generate a unified client, while we keep separate API backends. This sounds more like option 1 than option 2, but maybe I'm misreading your suggestion.

Oh yes, sorry, move to the 1 client, api gateway + several backends, edited the comment so it does not confuse anyone

dcaro edited projects, added Toolforge (Toolforge iteration 01); removed Toolforge (Toolforge iteration 00).Oct 10 2023, 1:58 PM

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 01) board.

I lean towards Option 1 as my top choice, although transitioning from Option 3 to Option 1 also seems like an okay path. Regarding the ease of entry-level contributions in a microservices versus monolith setup, I think it depends. A well-structured and modular monolith, even if large, can be quite approachable for new contributors while also helping understand how different parts of a larger system fit together, avoid some boilerplate, etc.

For the client I think one important condition is that we eventually want people to be able to install it locally, so a single binary on a compiled language makes sense to me. For the APIs it's mostly about the complexity tradeoff for me (lots of things to maintain and deploy / one really complicated thing to maintain and deploy), and I think I feel like that considering the all of the features Toolforge has (and will get in the next few years) I feel like the ability to manage different services in different ways behind the API gateway is worth the extra complexity. Thus I !vote for option 1, although I think having an unified API definition/documentation is something worth exploring.

I vote present.

In T346153#9243224, @Slst2020 wrote:

I lean towards Option 1 as my top choice, although transitioning from Option 3 to Option 1 also seems like an okay path. Regarding the ease of entry-level contributions in a microservices versus monolith setup, I think it depends. A well-structured and modular monolith, even if large, can be quite approachable for new contributors while also helping understand how different parts of a larger system fit together, avoid some boilerplate, etc.

I would remind that having a monorepo, and a monolith are different things, you can still have all the code in the same repo (shared boileplate, etc.), but not have a monolith. The discussion about having one repo or many was not included in this request.

In T346153#9245363, @dcaro wrote:

In T346153#9243224, @Slst2020 wrote:

I would remind that having a monorepo, and a monolith are different things, you can still have all the code in the same repo (shared boileplate, etc.), but not have a monolith. The discussion about having one repo or many was not included in this request.

Agree that a monorepo does not imply a monolith. My point (or assumption rather) was that if we're considering consolidating the CLIs into a monolithic structure, it would make sense to also keep them in a single repository, which is something I think simplifies management and reduces complexity, benefitting everyone but especially new contributors. My original comment does conflate the two terms a bit though, so thanks for clarifying.

fnegri moved this task from Inbox to Discussion on the Cloud Services Proposals board.Oct 12 2023, 12:07 PM

Slst2020 closed this task as Resolved.Oct 12 2023, 1:03 PM

Slst2020 updated the task description. (Show Details)

Slst2020 mentioned this in T348749: Decision request – Toolforge CLI consolidation.Oct 12 2023, 1:29 PM

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 01) board.Oct 25 2023, 7:31 AM

Decision request – Toolforge (re)architectureClosed, ResolvedPublicActions

Description

Problem

Decision Record

Risks & Constraints

...if the architecture _Doesn't Evolve_:

...if the architecture _Does Evolve_:

Options

Option 1

Option 2

Option 3

Option N

Note for Future Decisions

Questions for Consideration

Criteria for Evaluation

Related ObjectsSearch...

Event Timeline

Decision request – Toolforge (re)architecture
Closed, ResolvedPublic
Actions

Related Objects
Search...