Page MenuHomePhabricator

Decision request - Toolforge kubernetes container images
Closed, ResolvedPublic

Description

Problem

As of this writing, one of the main reasons Toolforge tool developers keep using GridEngine vs Kubernetes is because our current k8s setup doesn't support mixing runtime environments. A tool that uses both java & python can only run in the grid. In Kubernetes we provide a concrete list of container images with fixed runtime environments (for example, python, nodejs, php, java, etc).

In the past, it was decided that a buildpack-based approach was the right solution to this problem. However, that project is a technical challenge, complex and requires non trivial amount of engineering work. The result is that the project is not ready to go yet and is not expected to be available at least until TODO: when?.

There is, however, another potential approach to unblock this situation in the short term: enable Bring Your Own Container (BYOC), while the buildpacks project is completed. This means allowing Toolforge developers to create kubernetes workloads using containers images created by them.

Some clarifications

Let's assume we have 3 categories of users in Toolforge:

  1. non-engineer, basic users: they follow a tutorial to deploy a basic tool. They want easy abstractions and shortcuts to be able to perform complex tasks in a simple fashion. These users know only one programming language at most, and they don't know anything about containers, docker or kubernetes.
  2. intermediate users: anyone between the previous category and the next.
  3. engineer-level, advanced users: this user knows more than 1 programming language, knows some software engineering practices, and can follow an online tutorial to create a docker container. They know (or could easily understand) what's inside Toolforge, and the basics of how kubernetes works.

The BYOC feature is targeted for users in category 3. Which are the users that have the most complex tools in Toolforge, potentially in the grid, that cannot move to kubernetes because (for example) they mix multiple exec runtimes.

Users in this category had traditionally showed interests about BYOC for Toolforge in the past.

Constraints and risks

The fact that we disallow BYOC is mostly documented in a single place, this wikitech page, which reads:

We restrict only running images from the Tools Docker registry, which is available publicly (and inside tools) at docker-registry.tools.wmflabs.org. This is for the following purposes:

1. Making it easy to enforce our Open Source Code only guideline
2. Make it easy to do security updates when necessary (just rebuild all the containers & redeploy)
3. Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet)
4. Access control is provided totally by us, less dependent on dockerhub
5. Provide required LDAP configuration, so tools running inside the container are properly integrated in the Toolforge environment

This is enforced with a K8S Admission Controller, called RegistryEnforcer. It enforces that all containers come from docker-registry.tools.wmflabs.org, including the Pause container.

Any decision taken in this topic should consider those five points.

In particular, one could argue that:

  1. we don't have any active scanning of software inside containers. Claiming that our users comply with the open-source-code-only policy because we control the base container image is a bit naive.
  2. we should review and discuss our current security maintenance practices for Toolforge. This is pretty much independent of any BYOC/buildpacks debate.
  3. deployment speed is a good point, but mostly relevant for tools that redeploy constantly. If we detected this was a problem, we could open our already present docker registry for tool users to cache their images in there
  4. is not clear what access control means in this point, or what specific needs we have.
  5. The LDAP configuration is important, so if we enabled any form of BYOC then clear instructions should be provided for our users to build their container images using a base layer of our own. Otherwise their tools may not work as expected.

Decision record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T302863_toolforge_byoc

Options

Option 1

Enable BYOC. This enables a new workflow/usecase in Toolforge.

The simpler implementation of this option consists on:

  • disabling our custom kubernetes registry admission controller
  • create some docs for our users on how to effectively benefit from the new feature.
  • communicate with our users.

What to do with BYOC if and when buildpacks are ready to go is left for a future decision process. In particular, enabling BYOC does not prevent the buildpack project from being completed/implemented.

Pros:

  • Less dependency on the grid.
  • Less dependency on NFS (users could just deploy their code in the container, and so we have one less dependency on NFS).
  • Easy to implement.

Cons:

  • Enabling a new feature may mean supporting this new feature forever? If so, see option 2.

...

Option 2

Enable BYOC on a *temporal fashion*. This enables a new workflow/use-case in Toolforge but only during the period of time where the buildpacks project is not completed, with the sole purpose of helping people migrate their tools away from GridEngine into Kubernetes.

The simpler implementation of this option consists on:

  • disabling our custom Kubernetes registry admission controller
  • create some docs for our users on how to effectively benefit from the new feature.
  • clearly communicate with our users, with a focus on the temporal fashion of the new feature.

Pros:

  • Less dependency on the grid.
  • Easy to implement.

Cons:

  • Given the temporal fashion, users may choose to don't adopt the solution.

...

Option 3

Leave BYOC disabled (discard this request). Hope that the buildpack project completes soon.

Pros:

  • TBD

Cons:

  • TBD

Option 4

Enable BYOC only for a few selected users that request it.
Similar to some special Cloud VPS features that are enabled only to special projects that can demonstrate the requirement.

Implementation would be as follows:

  • modify the registry admission controller and introduce support for reading a configmap with an allow list
  • if a tool namespace is present in the allow list, then allow arbitrary container registries
  • the WMCS will review requests and update the configmap accordingly

Pros:

  • The impact of this feature being enabled for arbitrary users is therefore limited.

Cons:

  • Means the WMCS team has to gatekeep this feature.

Event Timeline

From the top of my mind, I see a couple things that can be added to the cons of the overall solution:

  • Size of the containers -> build your own means that anyone can use any base image, with any content, that also means extra space for each layer that is not shared, as opposed to reusing the same layers (as we do now, or buildpacks would in most cases)
  • Temporal as I see it is, at least 5 years in (given other temporal things), so it would effectively mean maintain yet another service.
  • Creating a container that integrates with NFS and such is not trivial, so I see either us creating a base container and a method to extend it, or giving support on how to create such container, both cases are extra setup and maintenance work, this would challenge to some extent the "easy to implement" pro, maybe rephrase as "easier to implement than buildpacks".

If the solution is only for a container that does not integrate with NFS/ldap, then the first implementation of buildpacks already will do that, and I don't think it's so far away, so I'd say it's better to focus efforts there instead.

"Claiming that our users comply with the open-source-code-only policy because we control the base container image is a bit naive."
Agree, though it does not take away that it's way way better to have the base image than not to, even if it does not mean full compliance (and makes it way way easier for any future effort in compliance).

From the top of my mind, I see a couple things that can be added to the cons of the overall solution:

  • Size of the containers -> build your own means that anyone can use any base image, with any content, that also means extra space for each layer that is not shared, as opposed to reusing the same layers (as we do now, or buildpacks would in most cases)

See below for container layers, and the base image.

  • Temporal as I see it is, at least 5 years in (given other temporal things), so it would effectively mean maintain yet another service.

BYOC doesn't introduce a new service for us to support, only a new use case. We have literally less things to maintain if we enable BYOC, and no more. Please correct me if I'm wrong.

  • Creating a container that integrates with NFS and such is not trivial, so I see either us creating a base container and a method to extend it, or giving support on how to create such container, both cases are extra setup and maintenance work, this would challenge to some extent the "easy to implement" pro, maybe rephrase as "easier to implement than buildpacks".

The only thing required to integrate with our NFS/LDAP infra is to base the container on this one:

https://github.com/wikimedia/operations-docker-images-toollabs-images/blob/master/bullseye-standalone/Dockerfile.template

Which only has 2 things that make it special:

  • a package installed: libnss-sss
  • a file copied: /etc/nsswitch.conf

That's all.

If BYOC is enabled, we should boldly recommend that our users base their images on that one for ease of use.
If they choose not to, their tool likely won't work as expected. So, a win for us, a win for them.

Additionally, if wanted to enforce this, we could modify the already present registry admission controller to inspect the docker image of a pod being created to ensure it uses one of our trusted base images. But honestly, I think that's optional.

The only thing required to integrate with our NFS/LDAP infra is to base the container on this one:

https://github.com/wikimedia/operations-docker-images-toollabs-images/blob/master/bullseye-standalone/Dockerfile.template

Which only has 2 things that make it special:

a package installed: libnss-sss
a file copied: /etc/nsswitch.conf

That's all.

That only works for the debian base image, and probably might need changes depending on the debian version. (I'll use this for the buildpacks :) )

See below for container layers, and the base image.

This is not addressed with the comments, the sprawling of layers and disk usage is not handled (that would include maintenance of the servers to make sure they cleanup layers, that they don't get out of space, etc.).

BYOC doesn't introduce a new service for us to support, only a new use case.

That depends on the definition of a service for you use. In any case, what matters is that we have to maintain docs, some code (ex. the base images), make sure it runs somewhere (in this case k8s), and support users, so effectively, extra work.

We have literally less things to maintain if we enable BYOC, and no more. Please correct me if I'm wrong.

I don't think we have less things to maintain, there's nothing that goes away by enabling this service, and there's no plans that change either, so overall it's an addition now and later.

That only works for the debian base image, and probably might need changes depending on the debian version. (I'll use this for the buildpacks :) )

I don't see what is the problem with every container image being based on a debian base image of our control (the base image with the NFS/LDAP integration bits enabled).
This is what we have today anyway. Every container running on k8s uses this debian base image today.

See below for container layers, and the base image.

This is not addressed with the comments, the sprawling of layers and disk usage is not handled (that would include maintenance of the servers to make sure they cleanup layers, that they don't get out of space, etc.).

Kubernetes does this on its own, see here https://kubernetes.io/docs/concepts/architecture/garbage-collection/#containers-images I don't think we need anything special for this.

If we detect storage problems related to this (unlikely) we will need to figure out other solutions. But this is true as well for buildpacks-generated images. BYOC doesn't introduce any additional pressure here.

BYOC doesn't introduce a new service for us to support, only a new use case.

That depends on the definition of a service for you use. In any case, what matters is that we have to maintain docs, some code (ex. the base images), make sure it runs somewhere (in this case k8s), and support users, so effectively, extra work.

I don't think I follow. BYOC is a feature that is currently disabled. We only have to enable it. We don't have to create anything specific, or maintain anything.
On the contrary, disabling BYOC requried to create a custom registry admission controller, which needs to be maintained, deployed, documented, etc.

So, to your statement:

  • maintain some docs: fair, I guess?
  • some code (base images, and?): We already maintain the container images. Enabling BYOC doesn't imply any new code for them...
  • make sure it runs somewhere: make sure k8s can run a container? That's pretty basic. We don't need anything special for that, no? I don't follow...
  • support users: fair

I don't think the extra work you are referring to is meaningful, honestly. Or at least, I would need a more elaborated explanation to better understand what you mean.

I don't think we have less things to maintain, there's nothing that goes away by enabling this service, and there's no plans that change either, so overall it's an addition now and later.

Our registry admission controller could go away. A dependency on NFS could go away.
While minimal, this could be considered a net loss (in this case, a win) :-P

I don't see what is the problem with every container image being based on a debian base image of our control (the base image with the NFS/LDAP integration bits enabled).

Then we will have to enforce that every container depends on this image (and find out a way to enforce it).

If we detect storage problems related to this (unlikely) we will need to figure out other solutions.

The main problems I've had in the past with storage issues in k8s is exactly this, running different containers in a node and pulling many different layers for each, I guess that it depends on the actual usage, of which we have no data for, maybe we can make an educated guess from what's currently running on the grid?

But this is true as well for buildpacks-generated images. BYOC doesn't introduce any additional pressure here.

One of the big advantages of buildpacks is the reusage of the core layers, and only having one custom layer with the results of the build process (not build tools). That also allows buildpacks to do upgrades to the core layers that add very little extra space and don't need to re-pull the layers themselves. So BYOC would use notably more storage using all the same base image or not.

maintain some docs: fair, I guess?

Specifically about how to build your own containers to run in toolforge, unless we force them to use our base image, then they would be about how to build on top of that (that should be easy though). There's things like mounts, users, and such that still would need some docs.

some code (base images, and?): We already maintain the container images. Enabling BYOC doesn't imply any new code for them...

If we have to support the base images (for whatever the user might want) then that goes there, if we want to force users to only base on our base image, then we'll need a solution for that, in any case, meaning some code.

make sure it runs somewhere: make sure k8s can run a container? That's pretty basic. We don't need anything special for that, no? I don't follow...

This includes making sure that that flow keeps working, and monitoring/managing all the extra resources that this flow requires (storage, network, security) that we don't currently have to care about because we don't allow it. And potentially the admission controller that checks we are only running containers based on our base image.

Our registry admission controller could go away.

Fair enough, though this would be replaced by some other kind of service/controller that disallows containers not using our base image if that's the way we go.

A dependency on NFS could go away.

I don't think this is true, and will not be true without a very substantial effort (that's partially what the toolforge build service as a whole, apart from buildpacks, tries to eventually address).

So given the two sections above, it's either a small temporary removal of the admission controller (if we allow everything), or replacing it with a different controller to make sure containers use our base image. And of course, adding all the extra work from before and extra resources usage.

Please, see the latest update to the task description. I added a clarification regarding the target users for the BYOC proposal.

The target for BYOC are users that know how to build a docker image on their own. Creating a docker image on ones laptop is an industry standard in software development. They don't need us to explain to them how to do that, there are literally millions of pages with information about this on the internet. These users will easily understand the reasoning behind the recommendation to use a toolforge base image to build their own containers, and they know how to do so. Nonetheless, creating some specific documents on wikitech won't hurt either, but honestly the documentation side of this proposal is not a big deal.

You mention some functional requirements for containers like mounts and such. That's not how Toolforge kubernetes works, see here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#volume_admission

Regarding NFS, we have a list of requirements why NFS exists today, they are roughly:

  1. to store tool source code
  2. to distribute tool source code to exec nodes on the grid, or worker nodes on k8s
  3. to store some user configuration, like email preferences, credentials for toolsdb, kubernetes,, etc
  4. to store tool logs
  5. potentially other usages that I'm forgetting at the moment.

Enabling BYOC means dropping or reducing pressure on at least one or two items from that list. That's a net benefit. Possibly the same with buildpacks, and that's OK. The intention of this proposal is not to drop all items. But is a nice side benefit worth considering.

Example:

  • if an user decides to use BYOC, and puts all source code in the container,
  • if running a tool doesn't require writing information to log files, or reading config from the home directory (like toolsdb credentials or XML dumps)
  • then the tool doesn't need access to NFS/LDAP at all. They don't even need to base the container image in one of our own.

In my opinion we must evaluate BYOC as a viable short term option for Toolforge. It is cost-effective and can help us build a better case for deprecating grid engine in the short term.
If we had enabled BYOC back when I first proposed it some months ago we would have been on a much better position today when promoting the latest grid stretch->buster migration.

So far I haven't seen any particular cons (or blocker) in your comments. Do you have concrete things in mind that we should consider as a blocker as part of this proposal?

I think that this would be better discussed sync in the discussion meeting, I seem to be unable to convey all the concerns with adopting this solution.

Not sure if this was mentioned already, but we currently rely on the images to ensure that users can't get root access inside the container and that way have ability to read and write arbitrary files for any tools (since we mount full /data/project NFS directory), and this is pretty much a hard requirement for mounting NFS inside any custom images.

In T302863#7751158, @Majavah wrote:

Not sure if this was mentioned already, but we currently rely on the images to ensure that users can't get root access inside the container and that way have ability to read and write arbitrary files for any tools (since we mount full /data/project NFS directory), and this is pretty much a hard requirement for mounting NFS inside any custom images.

Kubernetes controls the runtime capabilities of the container and you cannot override them arbitrarily from within the container itself. See here for details: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/RBAC_and_PSP#Toolforge_user_policies

Kubernetes controls the runtime capabilities of the container and you cannot override them arbitrarily from within the container itself. See here for details: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/RBAC_and_PSP#Toolforge_user_policies

That might soon not apply though, will have to investigate before we move to 1.25, where PSPs are removed (https://kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/#eliminate-non-standard-options), might require a mutating hook.

Kubernetes controls the runtime capabilities of the container and you cannot override them arbitrarily from within the container itself. See here for details: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/RBAC_and_PSP#Toolforge_user_policies

That might soon not apply though, will have to investigate before we move to 1.25, where PSPs are removed (https://kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/#eliminate-non-standard-options), might require a mutating hook.

Yes, we're aware of this. Be it PSP or any other mechanism, having strict controls of what pods can/cannot do inside Toolforge Kubernetes is one of the basic pillars of the security and stability of the service. If any flaw is found here it should be fixed.

This has nothing to do with BYOC or buildpacks.

This has nothing to do with BYOC or buildpacks.

Well, buildpacks have the control to set whatever user they want at the container level, something that BYOC does not, so imo that gives that strict control out of the PSP, making it easier in case it's not supported by the newer pod security options. So I think it's related (just longer term).

This has nothing to do with BYOC or buildpacks.

Well, buildpacks have the control to set whatever user they want at the container level, something that BYOC does not, so imo that gives that strict control out of the PSP, making it easier in case it's not supported by the newer pod security options. So I think it's related (just longer term).

Just a side note: the PSP k8s API is being deprecated and we will migrate out of it. But we will keep the semantics.

There is no scenario being considered in which we stop enforcing pod features stuff from the container orchestrator point of view.
The runtime user is just a tiny part of everything that must be enforced.

But again, I don't think we should compare BYOC vs buildpack as competing options. We can enable BYOC today and buildpacks later when ready. I don't think there is mutual exclusivity here of any kind.

But again, I don't think we should compare BYOC vs buildpack as competing options. We can enable BYOC today and buildpacks later when ready. I don't think there is mutual exclusivity here of any kind.

Is this a short term solution that itself turns into a long term problem? How will we ever get rid of BYOC once it is allowed?

But again, I don't think we should compare BYOC vs buildpack as competing options. We can enable BYOC today and buildpacks later when ready. I don't think there is mutual exclusivity here of any kind.

Is this a short term solution that itself turns into a long term problem? How will we ever get rid of BYOC once it is allowed?

The same way we're getting rid of GridEngine: by offering a better solution.

See option 2 in the proposal. We could communicate clearly that the feature is time-boxed, and the time box depends on buildpacks being ready.
I don't anticipate a lot of adoption for BYOC anyway (famous last words?), for the reasons described in the "Some clarifications" section.

That being said, it would be good to write down why BYOC is a problem (or could be a long term problem).

I'm adding Option 4: Enable BYOC only for a few selected users that request it.

Similar to what we do in Cloud VPS, where we enable certain features for users that request it and can demonstrate they really need it.

A tool that uses both java & python can only run in the grid.

wd-shex-infer is such a tool. I suspect that it would be possible to port this tool to Kubernetes: run the webservice in k8s, and have it schedule multiple jobs, using different container images for the different phases of the work the tool does (as far as I remember, it’s mainly a Java phase followed by a Node phase). I don’t think I ever really need multiple runtimes at the same time; having them available on the Grid just made it easier to build up the current pipeline in the tool. That said, the tool isn’t used very often, I’m no longer particularly interested in it, and I probably wouldn’t spend the development effort to do this k8s port.

In light of this, BYOC could help to keep this tool alive: throwing together my own container image, installing packages until the pipeline no longer crashes, and then exchanging the Grid for k8s using this image, sounds to me like it should be an acceptable amount of effort, less complicated than splitting the pipeline into multiple phases and managing different jobs there. But apart from this one tool, I’m not overly excited by the prospect of BYOC: I don’t recall any recent tool ideas I’ve had that would’ve required it, and in general it seems like a good thing that updating the containers, applying security fixes etc., is left to more capable hands than mine.

That said, it’s possible that I’m so used to a non-BYOC Toolforge that I don’t even think of the kind of tool ideas anymore that it would enable; in a world where BYOC has been established for a while, maybe I would have all kinds of other interesting ideas for things to build that aren’t possible at all right now.

We had a meeting today.

The decision was to go with option 3: don't enable BYOC. We will wait for buildpacks to be ready. Also, we will regard buildpacks as a requirement for deprecating or removing grid engine from Toolforge.