In T385173, we prepared a patch (1146891) to add the wmf-debian-vllm image to the wikimedia production images repo and eventually the wikimedia docker registry. We addressed all reviews on the patch and successfully tested the image on ml-lab1002, while resolving infra compatibility issues (P76252, P76288, P76290, P76308) with help from SREs. This patch is ready to merge, but the image build requires a ton of resources (grafana dashboard), that may not be available on build200X. SREs advise that we'll build and push this image to the docker registry using ml-lab1002.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | DPogorzelski-WMF | T394778 Build and push images to the docker registry from ml-lab | |||
| Resolved | DPogorzelski-WMF | T412524 New WMF docker registry credentials | |||
| Resolved | DPogorzelski-WMF | T416966 Unable to pull vLLM images from Wikimedia docker registry due to authentication error |
Event Timeline
For this to work, Appropriate credentials need to be on ml-lab1002 (or 1001). The future proof way to do this would be to either apply the relevant Puppet role(s) to it, or, if that adds too much functionality/infrastructure, extract the relevant bits from that role or make that role modular as needed.
There are shortcuts we could take, but the whole docker building setup on the lab machines is already way too ad-hoc, so I would prefer to not take shortcuts.
I've written up my thoughts, and some of the things we discussed outside of this ticket regarding making vLLM images available for use with LiftWing workloads:
Current status
ML is working on developing some LLM-based services. The current candidate framework for this vLLM. We (Kevin) have used one of the ML lab machines to build a somewhat reduced image for these purposes. It's total size is still 20+G, but the individual layers are small enough that theoretically, it could be uploaded to the WMF registry without triggering the exhaustion of the tmpfs used by nginx.
However, this image can not be easily built on the build host, due to lack of memory and disk space (the intermediary steps are far larger than the final artifacts). Moreover, when working on the image (change, build, test, repeat), having a GPU available on the host is basically mandatory. This naturally further precludes use of the build host.
We are currently building this vLLM image using docker-pkg as recommended in: https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Local_builds
What ML needs
ML needs a way to build (and iteratively test) Docker images in a useful way that is available to the team members (i.e. not just SRE); the images need to be uploaded to _some_ Docker registry and accessible to LiftWing, to run workloads based on said images.
Ideally, building these images is as automated as possible, though they would not change as often as the actual service images (they would be similar to rocm-torch image that changes at most on a several-months cadence, if we disregard the weekly-rebuild mechanism on the build hosts for a moment).
Options
We have several options for achieving the above, some of which may be combined with each other:
- ML runs their own Docker registry a. This is no trivial undertaking, and will eat manpower b. In some ways, this is the most flexible approach
- We build the image on ML-Lab a. This is basically a given, at a minimum during the development phase, since there are no GPUs of the kind we need (MI210 or MI300) available elsewhere b. There are some questions around availability of Docker access (which is basically root-equivalent from an attacker's POV) on a machine like the lab machines which allow access to non-ML-Team/non-SRE users. c. The latter part of the above point could be addressed by restricting one of the ML lab machines to ML team+SRE.
- An SRE copies the image from ML-Lab to the build host, and uploads it to the Registry from there. a. The 20G image compresses really well (around 2.5G with XZ), which makes the transfer feasible without direct access between lab and build machine. b. This would still be a manual step, but that can also be seen as a feature in avoiding polluting the registry, or escalating access.
- We add the relevant machinery and credentials to the lab machine. This would require 2.c to be implemented as otherwise there is a (relatively) easy escalation path from a compromised user to subverting the integrity of all images on the registry. a. This would (potentially) allow non-SRE ML-Team members to build and build and upload the vLLM image without requiring SRE assistance. b. Even if we do not allow non-SRE users to use the upload-to-registry credentials on the machine, it would eliminate the copy-to-buildhost step. It's slightly faster, but experiments have shown that the copy takes minutes, not hours, especially with compression (see 3.a).
Summary
We can probably unblock ML team's work on the vLLM services using the approach described in 3. above. In the long term, however, another method would be more maintainable. Which of the above items (and ones that I haven't thought of yet) would be combined largely depends on the available manpower for maintaining it, the desired security level, and required ease of use. Naturally, otherwise unrelated future plans for the WMF Docker registry may also influence the best approach here.
Thanks for the summary! I see two separate problems being listed:
- Have a separate Docker Registry to be able to push images with compressed layers bigger than 4G. Something not really pressing since the VLLM image is in theory uploadable to the Registry, but that we may consider.
- Have a place where the ML team can build, test and push images. The environment should have GPUs and available memory/cpus to support the LLM use cases, something that it is currently difficult for the build hosts (most notably, we don't have a GPU in there).
I'd personally go for option 2., restricting ml-lab to the ML team and SRE only. To be noted that granting this particular setting wouldn't remove the need of being really mindful about what runs on the ml-lab nodes (for example, using dockerhub images or similar) since the same prod security requirements would still hold, even if ml-admins would have root-like privileges.
Having a separate registry shouldn't be hard now that Alex refactored the puppet code to allow multiple instances on the same VMs, but I'd personally hold it for the moment.
The next step is probably to submit a proposal to the K8s-SIG for approval :)
I had a thought on this point. Have we looked into running a docker-registry as a Kubernetes service?
I was thinking that we could perhaps run a new registry on either the ml-serve or dse-k8s clusters. For its backing store, we sould use the S3 interface of the cephosd cluster, and optimise this for large images.
Hi Ben! I'd personally avoid this particular road, since the Docker Registry is an essential dependency for the K8s clusters and running it on top of them seems to be risky. We do have the possibility to run separate Docker registries via puppet now (listening to different ports basically) on the registryXXXX vms, so in theory if we really need it we could add another one backed by apus/s3 easily (we are already testing something like that as a consequence of T390251). For this particular use case, I don't think that we need another registry yet, since the vLLM image should be uploadable, the most pressing problem (imho) is finding a suitable host for testing and building images with GPUs. Tobias' Option 2 could be viable, but it needs to go through a formal process etc..
Yes, I see. Thanks for the explanation. I'll follow the progress of the migration of the docker-registry to apus/s3 with interest.
I suppose that I would share Matthew's concern here: T394476#10858578 that apus might not be optimized for this use-case, but as you mention later, you're taking this into consideration and not looking for a big-bang migration.
I was thinking that the Data Platform ceph cluster might be better suited as a back-end specifically for these very large vLLM images and similar, given its hardware profile.
No strong feelings about it, but this would seem like a good fit, to me.
...the most pressing problem (imho) is finding a suitable host for testing and building images with GPUs. Tobias' Option 2 could be viable, but it needs to go through a formal process etc..
Understood. Could we do this by restricting access to the docker socket (and therefore who can build using docker), rather than restricting shell access to the machine?
Or alternatively, would podman be an option here, given that it doesn't need root?
Following up on this task as it is quite important for the proper utilization of the new GPU hosts.
My understanding is that the above point is the way that we are going to proceed and that this has been discussed and approved by the k8s SIG. That environment would be one of the ml-lab hosts with restricted access and we will proceed with Tobias's suggestion :
We build the image on ML-Lab
a. This is basically a given, at a minimum during the development phase, since there are no GPUs of the kind we need (MI210 or MI300) available elsewhere b. There are some questions around availability of Docker access (which is basically root-equivalent from an attacker's POV) on a machine like the lab machines which allow > access to non-ML-Team/non-SRE users. c. The latter part of the above point could be addressed by restricting one of the ML lab machines to ML team+SRE.
Is restricting docker access on this host only to ML team + SRE ok? If yes, are there any other decisions that need to be made so that we can start the implementation?
My understanding is that the above point is the way that we are going to proceed and that this has been discussed and approved by the k8s SIG. That environment would be one of the ml-lab hosts with restricted access and we will proceed with Tobias's suggestion
Well, sort-of. IIRC the SIG didn't oppose to the idea, but we decided to wait for a more formal/complete plan from ML.
We build the image on ML-Lab
a. This is basically a given, at a minimum during the development phase, since there are no GPUs of the kind we need (MI210 or MI300) available elsewhere b. There are some questions around availability of Docker access (which is basically root-equivalent from an attacker's POV) on a machine like the lab machines which allow > access to non-ML-Team/non-SRE users. c. The latter part of the above point could be addressed by restricting one of the ML lab machines to ML team+SRE.Is restricting docker access on this host only to ML team + SRE ok? If yes, are there any other decisions that need to be made so that we can start the implementation?
I think that it is fine to limit the access in that way, but see my point above about the plan. I have a few security concerns for example:
- We should be very clear and possibly avoid any attempt to pull Docker images from github. This is a host to test images on a GPU, but at the same time it is in production.
- We shouldn't leave tests running as docker containers for too long, maybe because we forget it etc.., because they may become security risks. For example, if we leave some model servers running with potentially critical security holes we may expose those host to some attacks. I'd suggest to think about something that periodically clean up the running containers.
- The ml-admins will effectively be root on these hosts, so we should be mindful about what tests are running and why (maybe it would be great to have a wikipage with a list of experiments running etc.. or anything else, just throwing an idea).
Until we reach a proper way to build-push & update these images can we have an image in the registry to unblock us from starting to deploy services and iterate on them? I'm talking about the image that has been also built on ml-lab and is described in this patch https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891
Pushing from ml-lab should be done with a basic set of configs puppetized and accepted by the k8s-sig, we cannot really allow it now that the host relies on manual/ad-hoc configs in my opinion. The work to do is not a lot, it is just getting the green light from the K8s SIG about this particular step and then puppetize docker etc.. on ml-lab (plus proper review of who can ssh to the host etc..).
I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:
- wipe the machine and manage the basics with puppet
- the machine will have docker installed
- the machine will be enrolled into gitlab as a gitlab runner
- the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
- SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)
If the above is fine i'm going to start looking at the first steps.
Feel free to comment or add interested parties to the discussion.
Thank you for picking this up, @DPogorzelski-WMF. If you proceed with the plan to wipe ml-lab1001, could you please move the contents of my (and/or other people's) home directory to ml-lab1002? Thanks in advance.
@DPogorzelski-WMF I think the plan is good, I have only a few further questions:
- IIUC ml-lab1001 will become a Trusted Gilab Runner, able to build anything under the gitlab namespace. There is another use cases, the production-images repo (for example, https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891 needs docker-pkg and a ton of GPU/RAM resources not available on standard build nodes). I am not 100% sure if big/complex base images can be expressed as gitlab repos, due to the less control on what ends up in each layer (to keep its size under the limit imposed by the current registry). We could think about allowing ml-lab1001's docker-pkg to push to the registry, but we'd need some guarantees about what/how it ends up being pushed etc.. Do you have ideas about this one? Also I am very ignorant about gitlab runners so if there is an easier way out of this lemme know :)
- Docker image testing - Is the node going to be used to also test images being built on it, before shipping them to the registry? Seems ok as use case if there are strict restrictions on what is being tested (never pull anything from dockerhub for example, and use only WMF-owned config/code etc..) and if there is a cleanup policy for containers being left around running (say somebody creates an inference service container with experimental code and leaves it running indefinitely after testing).
- let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we can take the gitlab enrollment in a second step. but just so you know, a gitlab runner can be tied to specific groups or even specific repos making it unavailable for anyone/anything outside of that scope. so this won't be a shared runner but rather an ML only one. in other words it would only accept jobs from ML specific repos and push to the WMF registry. regarding "making sure no weird stuff is pushed to the internal registry" i don't have an immediate solution beyond: due diligence, Gitlab CI steps blocking merge requests containing images from external sources. on a related note though, afaik we still use pip to install python dependencies from outside so we are not fully isolated/immune to the supply chain issues
- we could simply set ip: 127.0.0.1 in /etc/docker/daemon.json so that you can't bind containers to 0.0.0.0, this effectively disarms anything left running, also the machine is not exposed to the outside afaik.
- we can also block access to major public registries in the http_proxy or via iptables on the host:
iptables -A OUTPUT -d your-internal-registry.com -j ACCEPT
iptables -A OUTPUT -d registry-1.docker.io -j REJECT
iptables -A OUTPUT -d index.docker.io -j REJECT
iptables -A OUTPUT -d quay.io -j REJECT
iptables -A OUTPUT -d ghcr.io -j REJECT
iptables -A OUTPUT -d gcr.io -j REJECT
while iptables rules can be changed by people i trust everyone in the team so this is mostly to prevent shooting ourselves in the foot and pulling from outside by accident
+1 on all, seems a good plan, not sure if a higher level approach for blocking registries compared to iptables is available, but that is something that can be investigated later on.
About docker-pkg: we run it on build2002 or build2001, the build/push process is pretty much this one:
sudo -i; cd /srv/images/production-images; git pull; /usr/local/bin/build-production-images
There is a .docker/config.json auth config file in the root's homedir that grants the necessary credentials to push to the registry, so only root can effectively do it. On build2001 we also have the production-images-weekly-rebuild.timer, that is responsible to rebuild the images and push their updated versions to the registry every week. The timer is also responsible to commit these changes to the production-images repo, you can check a random image's changelog to see the result.
Having said that, I think we need to decide what to do with ml-lab1001, since IIUC you just need to push certain images like vllm without necessarily caring about weekly rebuild, given how big the ML images are. So the options are two:
- You run docker-pkg on ml-lab1001 using a restriction on the images to build, but still using the production-images repo.
- You create a new production-images-like repository in gerrit and commit to it only ML Dockerfiles, with their changelogs etc..
The latter is probably safer and cleaner, but I'd ask to the K8s SIG what is the preference since you are the first one doing it (dropping a note in the IRC channel is enough to trigger a conversation, as I mentioned before).
Another thing to think about: we have been using docker-pkg so far to have more control on the Dockerfiles for big base images like the vLLM or pytorch ones, but it may not be what you want/need in the future. The main problem to face is that the Docker registry only supports a certain maximum size of compressed layers, but it is a problem due to the current implementation (docker-distribution using Openstack Swift's client). It may not be the same if we experiment with other solutions like docker-distribution and S3/Ceph for a dedicated ML registry. This issue can surely be tackled in a later iteration, but keep it mind when planning the work: I suggested docker-pkg since we have been using it so far, but there are probably other roads to explore. To unblock Kevin's patch and vLLM's testing in general docker-pkg looks the easiest.
cool, i'll shoot a message in IRC to the sig regarding "You create a new production-images-like repository in gerrit and commit to it only ML Dockerfiles, with their changelogs etc.."
inference-services repo will most likely move to gitlab so we can probably store the specific ML dockerfiles there.
also good point about docker-pkg, let's keep that for now but probably not needed down the road.
I will tar gz all home folders separately on 01 and copy them into corresponding home folders on 02. Individual users can then untar and pick what they need
Thanks for the nice discussion everyone. Overall, I think with the suggestion of building images on a dedicated ML machine and with the precautions discussed, we are OK with moving forward and unblocking this.
The machines will need specific Docker configuration anyway (setting up the proxy for all operations) to be able to reach to the outside, this is probably not needed. And if someone decides to mess with the configuration (which requires root) and fetch outside image, no iptables rule would save us.
It would be nice if it could also push only under a specific hierarchy, e.g. /repos/<insert-start-of-ml-hierarchy>/. (/repos being the start of the Gitlab managed hierarchy of Docker images IIRC). We already have /releng (and dedicated username/password pairs for that) so there is prior art.
regarding "making sure no weird stuff is pushed to the internal registry" i don't have an immediate solution beyond: due diligence, Gitlab CI steps blocking merge requests containing images from external sources. on a related note though, afaik we still use pip to install python dependencies from outside so we are not fully isolated/immune to the supply chain issues
I don't think anyone can get fully immune to supply chain issues. Nor is it the goal. The goal is more like Make sure we retain control of our artifacts, so that we can rebuild them on an emergency without relying on the good will of others. To make it practical, it's completely different to force rebuild an image with a python package that was discovered to be compromised and a fix has been released on PyPi to having to wait for, an unknown and arbitrary amount of time, someone to rebuild an image.
There are a numbers of supply chain security improvements that can happen (some have already have happened/are happening):
- SBOMS
- signatures
- Provenance
- Enforced non root execution
- Frequent periodic updates
and the list goes on.
As always it's defense in depth, there is no silver bullet.
- we could simply set ip: 127.0.0.1 in /etc/docker/daemon.json so that you can't bind containers to 0.0.0.0, this effectively disarms anything left running, also the machine is not exposed to the outside afaik.
That would be nice. It's not critical, but defense in depth is always a good idea.
Can you add some more information as to why? As in does production-images not have sufficient rights? Do you intend to add images that just don't belong in the first one? Why would that be?
I am not against the idea, but it should be clear why and I am not sure I have understood.
inference-services repo will most likely move to gitlab so we can probably store the specific ML dockerfiles there.
Do I understand correctly that this is unrelated to the above point?
also good point about docker-pkg, let's keep that for now but probably not needed down the road.
Generally speaking, try to have Dockerfiles evaluated only by docker-pkg for things that will run in production. Not because docker-pkg is an awesome tool or anything, but because Dockerfiles are infamous for being being super easy to get wrong on a variety of things (this is why Blubber exists) and having them use the exact same tool as everything else increase the chance that someone will look into them and fix them, just because of familiarity.
I'll look into it.
regarding "making sure no weird stuff is pushed to the internal registry" i don't have an immediate solution beyond: due diligence, Gitlab CI steps blocking merge requests containing images from external sources. on a related note though, afaik we still use pip to install python dependencies from outside so we are not fully isolated/immune to the supply chain issues
I don't think anyone can get fully immune to supply chain issues. Nor is it the goal. The goal is more like Make sure we retain control of our artifacts, so that we can rebuild them on an emergency without relying on the good will of others. To make it practical, it's completely different to force rebuild an image with a python package that was discovered to be compromised and a fix has been released on PyPi to having to wait for, an unknown and arbitrary amount of time, someone to rebuild an image.
There are a numbers of supply chain security improvements that can happen (some have already have happened/are happening):
- SBOMS
- signatures
- Provenance
- Enforced non root execution
- Frequent periodic updates
and the list goes on.
As always it's defense in depth, there is no silver bullet.
- we could simply set ip: 127.0.0.1 in /etc/docker/daemon.json so that you can't bind containers to 0.0.0.0, this effectively disarms anything left running, also the machine is not exposed to the outside afaik.
That would be nice. It's not critical, but defense in depth is always a good idea.
Can you add some more information as to why? As in does production-images not have sufficient rights? Do you intend to add images that just don't belong in the first one? Why would that be?
I am not against the idea, but it should be clear why and I am not sure I have understood.
To be fair I can start with just using production-images to get the ball rolling. In the long terms and from an overall point of view it would be great to converge and centralize ML workflows/tools/code around a single place such as gitlab to streamline the dev experience, but this is for another time maybe.
inference-services repo will most likely move to gitlab so we can probably store the specific ML dockerfiles there.
Do I understand correctly that this is unrelated to the above point?
It is related but let's skip on this for now and just get a functional builder up.
also good point about docker-pkg, let's keep that for now but probably not needed down the road.
Generally speaking, try to have Dockerfiles evaluated only by docker-pkg for things that will run in production. Not because docker-pkg is an awesome tool or anything, but because Dockerfiles are infamous for being being super easy to get wrong on a variety of things (this is why Blubber exists) and having them use the exact same tool as everything else increase the chance that someone will look into them and fix them, just because of familiarity.
Change #1213972 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):
[operations/puppet@production] ml-build: define new machine name/type
Change #1214530 had a related patch set uploaded (by Klausman; author: Klausman):
[operations/puppet@production] installserver/partman: Add custom recipe for ml-build1001
Change #1213972 merged by Dpogorzelski:
[operations/puppet@production] ml-build: define new machine name/type
Change #1214530 abandoned by Klausman:
[operations/puppet@production] installserver/partman: Add custom recipe for ml-build1001
Reason:
Superseded by change 1213972
@akosiaris I suggested the idea since there is always the risk to push the same set of images (even non ML ones by accident) from two places (build hosts and ml-build). In theory we should be able to set a filter for docker-pkg to push only ml-related images on ML build hosts, but it may be safer to just have a separate repo in my opinion (so we keep things really separate). If it is too big of a concern we can proceed with production-images, but we'll need to check it out on ml-build hosts and then always use docker-pkg in a controlled way (maybe wrapper script could be sufficient).
Change #1219552 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):
[operations/docker-images/production-images@master] ml: add ml specific config Adding docker-pkg config specific to the ML namespace instead of using a spearate repo since we have dependencies here that we rely on.
Change #1219552 merged by Dpogorzelski:
[operations/docker-images/production-images@master] ml: add ml specific config
Change #1224091 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] profile::docker_registry: add the ML instance
Change #1224091 merged by Elukey:
[operations/puppet@production] profile::docker_registry: add the ML instance