Page MenuHomePhabricator

Move all Machine Learning Docker images under the /ml prefix in the Docker Registry
Open, MediumPublic

Description

After T420978 we should move all the other Docker images deployed on ml-serve clusters to the /ml prefix in the Docker registry. The procedure is the same as T420978, but this time I'll create a script to do the move since it will involve more images.

Overall, I would proceed in this way:

  • Identify what are the images that we want to move to S3. We don't have a way to do deep clean up on Swift/Docker-Distribution yet, so we have ML images from the beginning of time. We should move only the most recent/used ones, discarding the rest. I'll post a list here so people will be able to validate it.
  • Copy images over to the S3 backend.
  • Switch deployment-charts and inference-services repos to use the new naming scheme (basically adding /ml in the image names).

Event Timeline

elukey triaged this task as Medium priority.

Current ML images:

amd-gpu-tester
amd-pytorch-common
amd-pytorch21
amd-pytorch22
amd-pytorch23
amd-pytorch25
repos/machine-learning/ml-pipelines
wikimedia/machinelearning-liftwing-inference-services-article-country
wikimedia/machinelearning-liftwing-inference-services-article-descriptions
wikimedia/machinelearning-liftwing-inference-services-articlequality
wikimedia/machinelearning-liftwing-inference-services-edit-check
wikimedia/machinelearning-liftwing-inference-services-embeddings
wikimedia/machinelearning-liftwing-inference-services-huggingface
wikimedia/machinelearning-liftwing-inference-services-langid
wikimedia/machinelearning-liftwing-inference-services-llm
wikimedia/machinelearning-liftwing-inference-services-logo-detection
wikimedia/machinelearning-liftwing-inference-services-nsfw
wikimedia/machinelearning-liftwing-inference-services-ores-legacy
wikimedia/machinelearning-liftwing-inference-services-ores-migration
wikimedia/machinelearning-liftwing-inference-services-outlink
wikimedia/machinelearning-liftwing-inference-services-outlink-cache-adapter
wikimedia/machinelearning-liftwing-inference-services-outlink-transformer
wikimedia/machinelearning-liftwing-inference-services-policy-violation
wikimedia/machinelearning-liftwing-inference-services-policy-violation-cope-a-9b
wikimedia/machinelearning-liftwing-inference-services-policy-violation-gpt-oss-safeguard
wikimedia/machinelearning-liftwing-inference-services-qwen36
wikimedia/machinelearning-liftwing-inference-services-readability
wikimedia/machinelearning-liftwing-inference-services-reference-quality
wikimedia/machinelearning-liftwing-inference-services-revertrisk
wikimedia/machinelearning-liftwing-inference-services-revertrisk-multilingual
wikimedia/machinelearning-liftwing-inference-services-revertrisk-wikidata
wikimedia/machinelearning-liftwing-inference-services-revise-tone-task-generator
wikimedia/machinelearning-liftwing-inference-services-revscoring

Note:

  • The vllm images are already under the /ml prefix
  • Ores legacy and recommendation-api are technically services that could run also on Wikikube, so I am not 100% sure if we want to port them too or not.

From this quick check it seems that amd-pytorch21 and 22 are not used:

~/Wikimedia/inference-services$ git grep amd-pytorch
.pipeline/edit_check/blubber.yaml:base: docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-3-20250511
.pipeline/huggingface/blubber.yaml:base: docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-2
.pipeline/llm/blubber.yaml:base: docker-registry.wikimedia.org/amd-pytorch25:2.5.1rocm6.1-1
.pipeline/revertrisk/multilingual.yaml:base: docker-registry.wikimedia.org/amd-pytorch25:2.5.1rocm6.1-1
.pipeline/revertrisk_wikidata/blubber.yaml:base: docker-registry.wikimedia.org/amd-pytorch25:2.5.1rocm6.1-1-20260125
.pipeline/revise_tone_task_generator/blubber.yaml:base: docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-3-20250511
Makefile:# install torch (used by edit-check and likely other model-servers that host LLMs or rely on docker-registry.wikimedia.org/amd-pytorch25:2.5.1rocm6.*)
~/Wikimedia/production-images$ git grep amd-pytorch | grep control
images/amd/pytorch-common/control:Package: amd-pytorch-common
images/amd/pytorch21/control:Package: amd-pytorch21
images/amd/pytorch21/control:Build-Depends: amd-pytorch-common
images/amd/pytorch22/control:Package: amd-pytorch22
images/amd/pytorch22/control:Build-Depends: amd-pytorch-common
images/amd/pytorch23/control:Package: amd-pytorch23
images/amd/pytorch23/control:Build-Depends: amd-pytorch-common
images/amd/pytorch25/control:Package: amd-pytorch25
images/amd/pytorch25/control:Build-Depends: amd-pytorch-common
ml/vllm014/control:Build-Depends: amd-pytorch-common
ml/vllm085/control:Build-Depends: amd-pytorch-common

And probably the following are not needed:

  • wikimedia/machinelearning-liftwing-inference-services-ores-migration
  • wikimedia/machinelearning-liftwing-inference-services-nsfw
  • wikimedia/machinelearning-liftwing-inference-services-outlink-cache-adapter

@achou opinions? :)

@elukey Thanks for putting this together! The list looks good to me.

  • amd-pytorch21 / 22
  • machinelearning-liftwing-inference-services-ores-migration
  • machinelearning-liftwing-inference-services-nsfw
  • machinelearning-liftwing-inference-services-outlink-cache-adapter

Confirmed these are not used, so +1 to skipping all of them.

Two additions — we recently merged:

  • wikimedia/machinelearning-liftwing-inference-services-policy-violation-cope-b-a4b
  • wikimedia/machinelearning-liftwing-inference-services-editing-suggestions

Should they be included in the move?

I had a chat with Aiko yesterday and she raised a very good point - is it easy to move blubber-based image names and prefixes to the new /ml prefix? We quickly checked and it seems a little cumbersome, so the new proposal is to just consider any docker image prefixed with wikimedia/machinelearning- as something to route to the ML S3 backend in the Docker Registry.

Change #1299531 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] WIP - docker_registry: introduce migration backends in Nginx

https://gerrit.wikimedia.org/r/1299531

Change #1299531 abandoned by Elukey:

[operations/puppet@production] WIP - docker_registry: introduce migration backends in Nginx

https://gerrit.wikimedia.org/r/1299531