Page MenuHomePhabricator

Toolforge beyond build service
Closed, ResolvedPublic

Description

Back when we first discussed moving the Toolforge CLI to an API architecture, there was a decision request about this, and another discussing what programming languages we should use for different components. However, the overarching architecture of a "Toolforge 2.0" wasn't significantly explored, perhaps due to our focus on the build service beta.

Recently, our efforts have expanded beyond the new build system, including a system for environment variables and secrets, a potential deploy subcommand, and more. A pattern is emerging, with each new subsystem being developed as an independent CLI-API pair. I’d like to argue for a simplified architecture that combines the benefits of a backend powered by microservices with the simplicity of a monolithic frontend. In the foreseeable future, this might extend to multiple frontends, if we aim to create a Toolforge UI.

The (simplified) diagram below is how I imagine what this might look like:

Toolforge.png (1×2 px, 2 MB)

CLI

  • One, unified user-facing CLI presenting a monolithic frontend, which would eventually absorb all the other pre-build system CLIs (jobs, webservice)
  • A single package/binary
  • Should follow good practices like modularity and separation of concerns, with the codebase organized in a way that separates different functionalities into their own modules or packages. All commands related to building might be in one module, all commands related to deploying web applications in another, and so on. This would allow for easier development, testing, and maintenance

Gateway API

  • Acts as a single entrypoint for client-side applications (currently only the CLI, but why not also a UI in the future?)
  • Decouples the client-side applications from the backend microservices, delegating any business logic
  • Small and focused codebase, mainly dealing with request routing, applying cross-cutting concerns, and sometimes aggregating responses from downstream services

Some benefits of this design:

  • Simplicity for Clients: Clients can treat the API gateway as a single point of interaction, without needing to know the details of the microservices architecture behind it.
  • Cross-Cutting Concerns: The API gateway can handle things like authentication, rate limiting, request logging, etc., which reduces duplication since these things would otherwise need to be handled in each microservice.
  • Isolation of Microservices: The API gateway can help protect the microservices by validating requests before passing them on, ensuring that only valid, authorized requests reach the microservices.
  • Aggregation of Responses: If a client request needs data from multiple microservices, the API gateway can call all the necessary services and aggregate the responses into one. For instance, if we implement toolforge deploy, we’d need to make calls to both the build and the webservice microservices.
  • Routing and Versioning: The API gateway can handle the routing of requests to different versions of microservices, or to different instances for load balancing and fault tolerance.

In this setup, each of our subsystems (build, webservice, jobs framework, etc.) would be independent microservices, each with its own separate responsibility, just as they are now. The API gateway would route requests from the clients to the appropriate service. This allows each system to be developed, deployed, scaled, and updated independently, while the clients only need to interact with the main API gateway. All the CLIs would be consolidated into one for a simplified development and distribution experience.

Tl;dr: I’m advocating for a monolithic (but modular) Toolforge CLI, and the introduction of a Gateway API to deal with the necessary decoupling of the client-side applications from the backend microservices.

Event Timeline

Gateway API

This already exists: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/APIs#API_gateway. And there is T332476: Toolforge: expose API gateway to the internet for making it available and usable for anything that's not on the bastions.

Gateway API

This already exists: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/APIs#API_gateway. And there is T332476: Toolforge: expose API gateway to the internet for making it available and usable for anything that's not on the bastions.

Even better. The proposal is then to use this as the sole entrypoint for all Toolforge client-side applications, present and future.

Gateway API

This already exists: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/APIs#API_gateway. And there is T332476: Toolforge: expose API gateway to the internet for making it available and usable for anything that's not on the bastions.

Even better. The proposal is then to use this as the sole entrypoint for all Toolforge client-side applications, present and future.

This is also the case, with the exception of the clis that interact directly with the k8s API :)

This is great, thanks!

Can you elaborate on the API gateway backend aggregation?

Would it be something like the cli calling '/deploy'. And the gateway calling 'buildservice/v1/build' and then 'webservice/v1/reload' or similar?

About the version management, would it be transparent for the cli? As in it would not know which versions of each microservice is it calling?

Would this include creating an abstraction layer at the API gateway itself? How will the versioning of that layer look like? Can you elaborate on how a breaking change release on a microservice would look like?

Thanks again! +100 for raising the subject!

We can table this discussion until we're all back from vacation, but I'm addressing your questions now while this is still fresh in my head. Or as fresh as it gets when it's 30ºC inside. Seriously considering moving back to Sweden.

Can you elaborate on the API gateway backend aggregation?

This would refer to how an API gateway can aggregate responses from multiple services into a single response for the client. This can be particularly useful when a client's request involves multiple services.

Would it be something like the cli calling '/deploy'. And the gateway calling 'buildservice/v1/build' and then 'webservice/v1/reload' or similar?

Yes, that's a good example of how it might work. When a client calls a '/deploy' endpoint, the API gateway could route this to multiple services, such as 'buildservice/v1/build' and 'webservice/v1/reload'. The gateway could then aggregate any necessary responses such as status messages, logs, etc. from those services into a single response back to the client. In this example, we have both response aggregation and service orchestration as an extra bonus :)

About the version management, would it be transparent for the cli? As in it would not know which versions of each microservice is it calling?

Ideally, yes. The CLI and other clients would interact solely with the API gateway, and the gateway would handle routing to the correct version of the microservice. This would mean the clients wouldn't need to know specifics about microservice versions.

Would this include creating an abstraction layer at the API gateway itself? How will the versioning of that layer look like? Can you elaborate on how a breaking change release on a microservice would look like?

The API gateway inherently acts as an abstraction layer, decoupling the client-side from the server-side concerns. This hides the complexity of the microservices architecture from the client, allowing client-side development to focus on the user interface and user experience.

In terms of versioning, this decoupling would help isolate changes within the microservices layer from impacting the client-side. When releasing a new version of a microservice with breaking changes, we can preserve backward compatibility by deploying it in parallel with older versions. The API gateway would then route requests to the correct version based on the request details (for example, URI version or header information).

Let’s say the CLI is ready to adopt the new version, but we also have a UI that will take longer to migrate for whatever reason. The CLI could be migrated over, while the UI can continue using the older version until it’s ready to switch. In addition to routing, backwards compatibility can also be maintained by transforming requests/responses to and from microservices at the gateway layer.

Selective routing at the gateway layer would also allow for additional possibilities and flexibility, such as rolling out a beta feature to only a restricted list of users or application types, etc. Or let’s say we route a user’s requests to a new version of a service, and something goes wrong. The gateway can then handle the error by falling back to the old version of the same service. Nothing breaks for the user, and we could collect error logs and analyze what went wrong.

I think the key benefit here is that if we update or change the microservices, these changes can be effectively 'hidden' from the client by the API gateway. As long as the client's interaction with the API gateway remains consistent, the backend microservices can evolve independently.

Sorry if this was a huge wall of text, more things kept popping up in my mind as I was writing. xd

Yes, that's a good example of how it might work. When a client calls a '/deploy' endpoint, the API gateway could route this to multiple services, such as 'buildservice/v1/build' and 'webservice/v1/reload'. The gateway could then aggregate any necessary responses such as status messages, logs, etc. from those services into a single response back to the client. In this example, we have both response aggregation and service orchestration as an extra bonus :)

That sounds to me something that a deploy service would do, not something done on the http ingress/routing layer that the API gateway is on. While the /deploy endpoint seems quite simple in your example, it still has quite a bit of logic to implement (what if the build fails? what if the tool does not have a webservice? what if it needs to restart a continuous job but only after some database migration finishes? what if you want to send the tool author some notification after a successful deploy?).

Yes, that's a good example of how it might work. When a client calls a '/deploy' endpoint, the API gateway could route this to multiple services, such as 'buildservice/v1/build' and 'webservice/v1/reload'. The gateway could then aggregate any necessary responses such as status messages, logs, etc. from those services into a single response back to the client. In this example, we have both response aggregation and service orchestration as an extra bonus :)

That sounds to me something that a deploy service would do, not something done on the http ingress/routing layer that the API gateway is on. While the /deploy endpoint seems quite simple in your example, it still has quite a bit of logic to implement (what if the build fails? what if the tool does not have a webservice? what if it needs to restart a continuous job but only after some database migration finishes? what if you want to send the tool author some notification after a successful deploy?).

The existing gateway API may indeed not be the place to add this type of functionality, if the desire is to limit its responsibilities to network-level concerns and have it be agnostic to any of the business logic. What I have been calling a gateway API may more accurately be called an orchestration layer or service then. It would still be a common entrypoint for all the backing microservices.

Having a separate deploy service which depends on build and webservice (and maybe others) would introduce excessive coupling between these microservices in my opinion, compared to having the orchestration layer call them sequentially and deal with the responses as necessary.

Some additional ideas. TL; DR: let's move to a monolitic API.

Toolforge is overall a tiny service from the API exposure/usage point of view, and we may not get a lot of benefits from the purely microservices approach after all, i.e, we don't need things like massive scale, and availability or release pattern for the jobs/ endpoint to be any different compared to the builds/ endpoint.

We may consider consolidating into:

  • a monolitic backend API: i.e, serve envvars, builds, jobs, etc from a single "fat" service, defined using swagger.
  • a monolitic frontend CLI: purely written in golang, autogenerated from the previous single swagger. The golang bit may give us advantages, like easily distributing the binary for people to use from their laptops. I don't think this bit is crontroversial. Imagine that users would only need credentials + curl a toolforge binary from their laptops to get started.
  • having a single swagger definition for everything could help us generate lang-dedicated libraries to interact with the single API. Imagine, generating a toolforge-lib-python, toolforge-lib-rust, toolforge-lib-php, toolforge-lib-js, etc. Toolforge tool developers could use them in their code to interact with toolforge, which is a limitation we currently have, and not being handled very well in any of the recent architectural decisions we're making.
  • the auth component, as it is today, handled by the API Gateway, could be folded into the monolitic API (or not, I don't really care). I think the API gateway was born as a way to reuse functionalities, but in a monolitic API model it may have less reasons to be.
  • this may greatly simplify the toolforge kubernetes setup overall, with less things to deploy, etc. Also potentially simplifying the development setup in each laptop for us engineers.
  • this monolitic approach also allows to easily implement whatever logic we need in a single place: like the mentioned /deployaction.

For a future-future-future, we may even consider extending this monolit to perform additional tasks like what maintain-kubeusers is doing today. In practice, making toolforge "brain" a single binary or app.

The more I think about this monolitic API approach, the more I like it. Why we didn't think about it before? Well, we needed to walk a path to get here:

  • at the beginning everything we had was the design of the webservices command line
  • we proved the API+CLI design was valid with the introduction of the jobs-framework
  • this architecture has been reused and extended to other toolforge services (builds, envvars, etc)
  • this is the point in which I'm writing this proposal: I no longer see the point in having multiple small API services.
dcaro changed the task status from Open to In Progress.Sep 1 2023, 12:55 PM
dcaro assigned this task to Slst2020.
dcaro moved this task from Next Up to In Progress on the Toolforge Build Service (Iteration 19) board.

Let me try to sort out my ideas here. Let's first take a look to what we care about:

  1. User wise:
    1. Users to have a cli as main entry point that they run from the bastions
    2. Users to have a cli as main entry point that they run from their system -> this is a long term goal
    3. Keep open the possibility of implementing a web UI as entry point implementing as little as possible (reusing what the cli does)
    4. Be able to offer users a single HTTP API as secondary entry point
    5. Be able to offer as many client libraries as possible (python, php, perl, javascript, ...)
  1. Maintenance/operational/development wise:
    1. Reuse/minimize as much boilerplate as possible, this might mean not duplicating (ex. monorepo) it or having some automated way to maintain copies (ex. modules)
    2. Make it easy for newcomers to develop and contribute
    3. Make it easy to avoid introducing new and old bugs (easy to test)
    4. Make it easy to deploy
    5. Keep systems as decoupled as possible
    6. Keep systems as cohesive as possible (things that "belong" together are placed together)
  1. There's also the discussion of how do we want to organize ourselves, if we are going to work in small groups (of 1 or more) or if we want to work as one group (ex. workgroup), as that will have a strong impact on the effectiveness of each of the solutions proposed.

Keep in mind that none of these are black or white, all are ranges of grey, and most (arguably all) of them are subjective.

About the solutions themselves, there's also many aspects to them:

  • Monorepo/multirepo
  • Monolith/per-service API/microservices
  • Common code in libraries/duplicated code/common code in sumbodules/...
  • Single binary-deployment/many binaries-deployments
  • Api-gateway or not
  • git-like cli (many smaller binaries)/fat cli
  • Single package/many packages

...

Note that most of these are related, but not the same (you can have a monolith using many repositories, deployed as different deployments that have different configurations).

I would like to have a goal first, and then propose 1+ solutions to it (and think a bit on how each affects each of the goals in mind).

I would like to have a goal first, and then propose 1+ solutions to it (and think a bit on how each affects each of the goals in mind).

For the decision request I mean, that way there's at least similar criteria to compare each proposal with.

In my view, the main goal here is reducing extrinsic complexity in the Toolforge architecture. I think the original message from @Slst2020 and many of the comments aim to reduce some of the existing complexity without introducing major drawbacks.

So to me the decision request could be split into many independent questions (the "aspects" listed by @dcaro), asking for each of them: is it acceptable to simplify this aspect, or doing so would have a negative effect in one of the "areas we care about" (as listed by @dcaro)?

E.g.

  • Monorepo vs multirepo: a single repo is simpler, do you think there is a significant drawback in choosing it?
  • Single binary-deployment/many binaries-deployments: a single binary is simpler, do you think there is a significant drawback in choosing it?
  • etc...

I expect on some of the solutions we will easily find a consensus (for example it looks like most comments so far agree on the idea of a "fat CLI") and on some others we will have divergent opinions.

There's also the discussion of how do we want to organize ourselves, if we are going to work in small groups (of 1 or more) or if we want to work as one group (ex. workgroup), as that will have a strong impact on the effectiveness of each of the solutions proposed.

While I agree the effectiveness of a solution varies with the size of the teams/groups, I am not convinced in our case this will have a "strong impact". The number of people currently involved in the development of Toolforge is pretty small. I personally think that both if we decide to work in very small groups (1-2 people) or slightly larger (4-5 people), the size of the groups will be below the tipping point where these decisions (e.g. single repo vs multiple repos) have a significant impact on efficiency. I found a relevant quote from Martin Fowler:

Conway's Law doesn't impact our thinking for smaller teams. It's when the humans need organizing that Conway's Law should affect decision making. (source)

While I agree the effectiveness of a solution varies with the size of the teams/groups, I am not convinced in our case this will have a "strong impact". The number of people currently involved in the development of Toolforge is pretty small. I personally think that both if we decide to work in very small groups (1-2 people) or slightly larger (4-5 people), the size of the groups will be below the tipping point where these decisions (e.g. single repo vs multiple repos) have a significant impact on efficiency. I found a relevant quote from Martin Fowler:

The relevant aspect is not the size of the groups, but the "responsibility" and agency of each group, similar to the way we have been working until now in which Arturo and Taavi were the only ones handling the jobs framework, while the toolforge build service workgroup was handling the build service and related code. If we agree on all working together on all of the code, then different standards and practices are not expected, so having everything on the same repo is good, if we decide to have different groups, the practices each group decides to follow will differ and thus makes more sense to split the code too. Same for reviews and techincal decisions, where each group would have their own view of where their project next steps.

@fnegri On the context of the quote you made, the full quote gives some more info:

A dozen or two people can have deep and informal communications, so Conways Law indicates they will create a monolith. That's fine - so Conway's Law doesn't impact our thinking for smaller teams. It's when the humans need organizing that Conway's Law should affect decision making.

Key point there being that "can have deep and informal communications", if we split into different teams (as we are right now), we have to start considering the communication patterns again :/

I find interesting to think about distributed teams in that sense, specially across timezones, as communication becomes quite different (less synchronous, less fluid, more independent, trying to decouple work). You could consider (as a thought experiment) that each member becomes a team of 1, where you would see split systems where only one person handles each single one of them and in order to "share" systems you either pair people in similar timezones, or have to find some way to compensate for the lack of communication by setting up communication heavy hours during the timezone overlap.

I appreciate the diverse viewpoints and thoughtful discussions that have been shared, both in this thread and in the recent Toolforge work group meeting, the notes for which can be found here: https://docs.google.com/document/d/1R67SqoIWghkTfjHd1o-jk0N54FsePEVSojSeV6Hn7ts/edit

The question now would be how to narrow this down into a few options for a decision request, or even into the minimal amount of options that would enable moving forward in an iterative way.

One option would of course be to just keep the status quo.

My original proposal argues for a hybrid approach, i.e. a monolithic CLI and a microservice-based backend. In this specific scenario, I see the proposed gateway API as non-optional in able to preserve decoupling (among other things; I don't want to restate arguments made elsewhere in the thread).

Another option is @aborrero's proposal to also consolidate the current backends into a monolithic API. This decision could be made independently from any decisions affecting the frontend, but in this case the need for a gateway API is less obvious.

Are there any additional high-level options?

I see kind of two goals here, to which I'll add some options in no specific order:

Simplify user client installation

  • With a single binary (generated from the API)
  • With several self-contained binaries + wrapper (similar to multiple rpms, but 'curlable')
  • With a single RPM (guessing that's your proposal, please correct me if I got it wrong)
  • With multiple RPMs but one 'meta' rpm to install all (toolforge-cli pulling in all the other clis, so you can still install only one specific cli if you want)
  • Multiple RPMs (current status)

Simplify development/operation/maintenance

System architecture

  • Smart API gateway + several per-service APIs
  • Slim API gateway + several per-service APIs with inter-service calls (this would be having for example a 'deploy' service that calls the others, instead of doing so in the API gateway)
  • Slim API gateway + several per-service APIs without inter-service calls (this is the current status, where the client has to be "smart" and glue the APIs around)
  • Single API service all the services in it

Code hosting

  • Put all the code in one repository (this means generating different binaries from the same repo for non monolith backend)
  • Put the code for each service on it's own repository (this means having submodules or similar for the monolith backend)

Not sure about the last one or if it's just a side-effect.

I hope that helps?

In my view, the main goal here is reducing extrinsic complexity in the Toolforge architecture. I think the original message from @Slst2020 and many of the comments aim to reduce some of the existing complexity without introducing major drawbacks.

Yes – this is how I see it as well.

@dcaro I appreciate the effort to enumerate various options for both client installation and development/operation/maintenance. However, I feel that diving into these specifics at this point might be a bit premature and could potentially detract from the larger architectural questions we're trying to address.

I think our immediate goal should be to narrow down high-level architectural options into a manageable set that can be formalized into a decision request. Once we have a clearer understanding of the overarching structure—be it a monolithic API, a microservices-based backend, or some hybrid approach—we can then delve into the finer details of client installation methods, code hosting, and so on.

Would it be possible to refocus the discussion on these high-level architectural choices? Are there any additional options or modifications to the existing proposals that you think should be considered at this stage?

I guess this

In my view, the main goal here is reducing extrinsic complexity in the Toolforge architecture. I think the original message from @Slst2020 and many of the comments aim to reduce some of the existing complexity without introducing major drawbacks.

Yes – this is how I see it as well.

@dcaro I appreciate the effort to enumerate various options for both client installation and development/operation/maintenance. However, I feel that diving into these specifics at this point might be a bit premature and could potentially detract from the larger architectural questions we're trying to address.

I think our immediate goal should be to narrow down high-level architectural options into a manageable set that can be formalized into a decision request. Once we have a clearer understanding of the overarching structure—be it a monolithic API, a microservices-based backend, or some hybrid approach—we can then delve into the finer details of client installation methods, code hosting, and so on.

Would it be possible to refocus the discussion on these high-level architectural choices? Are there any additional options or modifications to the existing proposals that you think should be considered at this stage?

I guess this section might be what you are looking for?

System architecture

  • Smart API gateway + several per-service APIs
  • Slim API gateway + several per-service APIs with inter-service calls (this would be having for example a 'deploy' service that calls the others, instead of doing so in the API gateway)
  • Slim API gateway + several per-service APIs without inter-service calls (this is the current status, where the client has to be "smart" and glue the APIs around)
  • Single API service all the services in it

That's only for the backend side though, there's two decisions tangled here, for the client side would be:

  • Single codebase/monolith (saying nothing about one or many rpms/packages/hosting/etc.)
  • Per-service codebases (again, saying nothing if it will be bundled in one package, or many)

So if you want to bind the decisions together, that'd be something like:

  • backend(Smart API gateway + several per-service APIs without inter-service calls) + client(Single codebase/monolith) - I think this was your proposal
  • backend(Single API service all the services in it) + client(Single codebase/monolith) - I think this was arturo's
  • backend(Slim API gateway + several per-service APIs without inter-service calls) + client(Per-service codebases) - this would be status quo

I would add the following two:

  • backend(Slim API gateway + several per-service APIs with inter-service calls) + client(Single codebase/monolith)
  • backend(Slim API gateway + several per-service APIs with inter-service calls) + client(Per-service codebases)

@dcaro Thanks for clarifying the options. I agree with your breakdown and suggest we tackle this in phases to avoid overwhelming the team with too many decisions at once. Here are the refined options for the first decision round:

  • Backend(API gateway + several per-service APIs) + Client(Single codebase/monolith) - My Proposal
  • Backend(Single API service with all services in it) + Client(Single codebase/monolith) - Arturo's Proposal
  • Backend(API gateway + several per-service APIs) + Client(Per-service codebases) - Status Quo

If the chosen option includes an API gateway and a microservice-based backend, a second round will decide between slim/smart API gateway and inter-service communication options. This keeps the focus on our immediate architectural choices. Once we've made a decision here, we can delve into the specifics.

These options would be evaluated against the goals/criteria mentioned in the discussions (throwing in a few additional ones :). In no particular order:

  • Reducing Complexity: How well does the option simplify the overall architecture and make it easier to manage and reason about?
  • Scalability: How well can the architecture handle increasing user demand and feature expansion?
  • Maintainability: What level of effort is required to maintain the system, including bug fixes, updates, and adding new features?
  • Team Alignment: Does the architecture align with the team's structure and practices, allowing for effective collaboration and contribution?
  • Ease of Installation: How straightforward is it to install the system, both for developers and end-users?
  • Ease of Contribution: How accessible is the system for new contributors?
  • User Experience: Does the architecture facilitate a seamless and efficient experience for the end-users?
  • Operational Complexity: What is the impact of the architecture on deployment, monitoring, and logging?
  • Iterative Development: How well does the architecture support iterative development and adjustments over time?
  • Backend(Slim API gateway + several per-service APIs without inter-service calls) + Client(Per-service codebases) - Status Quo

This should be Backend(API gateway + per-service APIs) + Client(Per-service codebases) right? (skipping the slim/fat gateway, and inter-service calls or not)

These options would be evaluated against the goals/criteria mentioned in the discussions (throwing in a few additional ones :). In no particular order:

This is really helpful, thanks!

This should be Backend(API gateway + per-service APIs) + Client(Per-service codebases) right? (skipping the slim/fat gateway, and inter-service calls or not)

Good catch – I've edited my previous comment to fix this.

Slst2020 changed the task status from In Progress to Stalled.Sep 28 2023, 8:54 AM