Page MenuHomePhabricator

Cloud services enhancement proposal: Single place to configure Toolforge container images
Closed, ResolvedPublic

Description

Proposal Title: Single place to configure Toolforge container images

Brief description: Establish a single location where all Toolforge tooling (webservice, jobs-framework, ...) pull the list of available images.

Why: See linked design doc.

Risks:

  • Additional/unnecessary code complexity.
  • Depending on the pull model, all changes to the list might be tedious to do if it requires rebuilds of several components (the current problem with webservice, but worse)

Design documentation: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_container_image_configuration

More info: See linked design doc.

Event Timeline

What about putting it in a k8s config?

I've read this proposal and I think this could be beneficial overall.

Some comments:

The whole workflow for maintaining container images is tedious and could benefit from some automation, polishing and more best practices (like, reduce usage of :latestand such), but that's out of the scope for this proposal I guess :-)

Mind that all this shall be temporal in a way, no? Once we have buildpacks perhpas this whole thing will greatly reduce its usage.

Initial POC based on a config map that everyone can read: https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config.

This works like this:

toolsbeta.test@toolsbeta-sgebastion-05:~$ kubectl get configmap -n tf-public image-config 
NAME           DATA   AGE
image-config   1      2m28s
toolsbeta.test@toolsbeta-sgebastion-05:~$ kubectl get configmap -n tf-public test
Error from server (Forbidden): configmaps "test" is forbidden: User "test" cannot get resource "configmaps" in API group "" in the namespace "tf-public"

I think we should move forward with this. We can have another spin when harbor is ready for prime time.

Change 861847 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/jobs-framework-api@main] Use the shared image-config configmap

https://gerrit.wikimedia.org/r/861847

aborrero moved this task from Inbox to Implementation on the Cloud Services Proposals board.

I've been thinking on the data model and I have a few comments.

Take this chunk as example:

jdk8:
  state: deprecated
  aliases:
    - tf-jdk8
    - tf-jdk8-DEPRECATED

  variants:
    jobs-framework:
      image: docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-base
    webservice:
      image: docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-web

How will we present this information to the CLI and/or end users? In particular, what about aliases and/or deprecation status?

As of today, for both jobs and webservices we need to end up with a clear shortname <-> URL mapping. Using the example above, would that translate into this?

for jobs:

shortnameURL
jdk8docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-base
tf-jdk8docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-base
tf-jdk8-DEPRECATEDdocker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-base

for webservices:

shortnameURL
jdk8docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-web
tf-jdk8docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-web
tf-jdk8-DEPRECATEDdocker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-web

If so, is the intended semantic of the state: deprecated tag to show a warning to the users when using one of the affected images?
Also, this feels like a lot of duplicated information. What is the meaning of the aliases spec? Should all the aliases names be printed as available? Or perhaps just accepted silently without offering them to the users?

I have a proposal:

jdk8:
  state: [deprecated|stable|default]
  variants:
    jobs-framework:
      image: docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-base
    webservice:
      image: docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-web

With that information, then we could update the clients (WS and jobs) to:

  • in jobs, print the -DEPRECATED tag next to the shortname based on the state content.
  • in jobs, generate the image name as tf-$whatever to keep it similar to what is present today. We should consider dropping the stupid tf- prefix, that I introduced, later down the road anyway.
  • we could also introduce state: default to indicate what is the default in WS. Or redefine that semantics anyway.

I think action items are:

  • clearly define the semantics of the state: field, or remove it from the data/schema
  • clearly define the semantics of the aliases field, or remove it from the data/schema
  • defined what to do with the 'default' container image in WS. If we came up with a good semantic, we could incorporate the concept to jobs as well.

Thank you for the feedback!

How will we present this information to the CLI and/or end users? In particular, what about aliases and/or deprecation status?

My vision is that name will function like it has before, so there will be one primary name for each container and the list of available images, the status of an individual job, etc use that single name.

The primary reason why I am intruducing aliases is to not break existing configurations. Since the jobs framework and webservice tooling have used different names it's not possible to keep existing configurations working without causing user visible breakage or having an aliases mechanism like this. It's not possible to just have multiple entries for the same image, since the image url -> name mechanism needs to be deterministic.

As of today, for both jobs and webservices we need to end up with a clear shortname <-> URL mapping. Using the example above, would that translate into this?

The current webservice implementation doesn't need a backwards mapping. But yes, when parsing user input I'd expect the tables look like that. For the listing of available containers I only expect to see the entry for jdk8.

If so, is the intended semantic of the state: deprecated tag to show a warning to the users when using one of the affected images?

Yes, like the deprecation flag in webservice does today.

Also, this feels like a lot of duplicated information. What is the meaning of the aliases spec? Should all the aliases names be printed as available? Or perhaps just accepted silently without offering them to the users?

See above, it's there to avoid breakage for existing configurations.

With that information, then we could update the clients (WS and jobs) to:

  • in jobs, print the -DEPRECATED tag next to the shortname based on the state content.

yes.

in jobs, generate the image name as tf-$whatever to keep it similar to what is present today. We should consider dropping the stupid tf- prefix, that I introduced, later down the road anyway.

This is not that simple, since there are other minor differences (think python39 or python3.9).

we could also introduce state: default to indicate what is the default in WS. Or redefine that semantics anyway.

I'd prefer to drop the concept of a 'default image' instead.

I think action items are:

These action items sound good. Do my answers above clarify what's currently meant or do you have futher concerns?

These action items sound good. Do my answers above clarify what's currently meant or do you have futher concerns?

Ok, thanks I think I'm all set.

On the deprecation logic, what about introducing these rules:

  • image listing only shows up-to-date images (the canonical names, no deprecates, no aliases or the like)
  • we silently accept aliases shortnames when creating jobs/WS. Until we eventually decide to drop them from the list, (enforcing some policy we don't have today but that may exist in the future).
  • we silently print back shortnames + deprecation notice when listing {jobs|WS status} if we found a deprecated docker image

Change 881875 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/jobs-framework-cli@master] Adjust for image name changes

https://gerrit.wikimedia.org/r/881875

Change 861847 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] Use the shared image-config configmap

https://gerrit.wikimedia.org/r/861847

Change 881875 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-cli@master] Adjust for image name changes

https://gerrit.wikimedia.org/r/881875

Change 883261 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/software/tools-webservice@master] kubernetes: Use the shared image-config configmap

https://gerrit.wikimedia.org/r/883261

Change 883261 merged by jenkins-bot:

[operations/software/tools-webservice@master] kubernetes: Use the shared image-config configmap

https://gerrit.wikimedia.org/r/883261

taavi claimed this task.