Page MenuHomePhabricator

Decision request - Toolforge build service to move to an API design
Closed, ResolvedPublic

Description

Problem

The current design relying on the cli to interact with k8s makes it hard to implement certain very desirable features like:

  • Users to see build logs
  • Quotas for builds
  • Multiple interfaces (Potential UI + cli)
  • Local cli execution (outside of the bastions)

Note that the scope of this decision is only about moving to an API design, the details of which language, etc. will be discussed later on in {T325382: Decision request - Choose a lang for the toolforge build service API} if we decide to go for it.

Constraints and risks

  • Developing certain features (starting with the above ones) will become harder and harder
  • Duplication of code implementing business processes will become a must for multiple interfaces (ex. cli, web, ...)

Decision record

Option 2 was chosen, move to an API design.

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T326136_Toolforge_build_service_to_move_to_an_API_design

Options

Option 1

Leave it like it is now.

Pros:

  • No immediate effort needed

Cons:

  • Increased and potential blockers implementing future features (see the list in the description)
  • Increased maintenance for more than one client platform

Option 2

Create an API living in k8s that will take care of interacting with the k8s service, and expose all the needed functionality to the clients through an http API.

Pros:

  • Unblocks many features
  • Eases the development and maintenance of different clients

Cons:

  • Some initial investment is needed to develop the API

Event Timeline

From the previous discussions:

{{T325382#8484498}}

taavi added a subscriber: taavi.Wed, Dec 21, 13:48
I do think moving to an API based design makes sense, but I don't have a preference whether it should be written in Go or Python.

{{T325382#8488374}}

...
For me, the reasons stated in the proposal aren't enough to justify the adoption of golang at such scale. That's why vote goes first for option 4, then for any of 3 or 2.

(4, 3 and 2 were in favor of API)

My vote goes to Option 2.

The API architecture allows us to have complete flexibility in terms of how we design and implement not only various clients, but also the backend. At one end of the spectrum, the API could be a facade that simply distributes tasks to other microservices, or it could be a monolith that handles all the work itself. Or anything in between. It's also easily extensible and could become the main entry point for anything Toolforge.

I vote for option 2.

Traditionally we haven't followed this API approach. It was first introduced with the Toolforge jobs framework as an experiment, to overcome some of the limitations we've traditionally found with tools-webservice (mainly: high maintenance costs, slow speed for releasing new feature, etc ) The API model of the TJF has proven that not only it is a sustainable approach, but that it can also reduce maintenance effort in some cases, and increase development speed for some features, and overall increase the reliability of the system.

This however, comes with some costs, primarily:

  • there are additional considerations regarding all things auth. Described here.
  • the development setup could become a bit more complex (literally, there are more components to set up). An ongoing effort to reduce complexity in the development setup can be found at the lima-kilo project.

In my opinion, the pros outweigh the cons.

However, here are some notes on the problem statement:

The current design relying on the cli to interact with k8s makes it hard to implement certain very desirable features like:

  • Users to see build logs
  • Quotas for builds
  • Multiple interfaces (Potential UI + cli)
  • Local cli execution (outside of the bastions)

Regarding the first 2 points: I don't understand why introducing an intermediate API would be of any special help in developing them.
Regarding point 3: yes, true.
Regarding point 4: that would be difficult due to the current auth model. We would need a more robust auth model to allow execution outside Toolforge bastions, that probably doesn't rely on internal kubernetes certificates as generated from maintain_kubeusers.

Thanks Arturo, let me see if I can answer your questions,

However, here are some notes on the problem statement:

The current design relying on the cli to interact with k8s makes it hard to implement certain very desirable features like:

  • Users to see build logs
  • Quotas for builds
  • Multiple interfaces (Potential UI + cli)
  • Local cli execution (outside of the bastions)

Regarding the first 2 points: I don't understand why introducing an intermediate API would be of any special help in developing them.

For point 1, "Users to see build logs":

The current limitation for users being able to see the logs from the build process is that the build process runs as a privileged user, due to the need for pushing images to harbor, so random access to all the containers logs (as provided by current k8s access control) would expose those credentials.

By adding the API layer, the user authenticates to the API, that then as the API user can read the logs, and only show the ones of the containers that don't expose any credentials.

For point 2, "Quotas for builds", same, k8s does not provide a way to set that up, and tekton does not yet support it either, having the API layer on top, again allows denying the request if the user has already used it's quota of parallel builds.

Regarding point 4: that would be difficult due to the current auth model. We would need a more robust auth model to allow execution outside Toolforge bastions, that probably doesn't rely on internal kubernetes certificates as generated from maintain_kubeusers.

I'm not sure I agree if the current auth model is robust enough or not, for sure it's possible to use it to connect to the API from anywhere you want as long as there's connectivity (just copy the cert over and allow the traffic, even using ssh tunnels if needed).

I also like option 2! I'm slightly concerned by the auth model as @aborrero mentioned, but I'm sure we can figure out how to fix it, if the current one is not robust or flexible enough.

Moving to this API-based design sounds like a big project overall, and I think it's important to break it down to smaller tasks/epics with well-defined goals, but as far as this decision request is concerned, I agree we should choose option 2.

No meeting was needed as there's consensus, Option 2 was decided.