Dev images registry
Open, Stalled, Needs TriagePublic
Actions

Assigned To

None

Authored By

	jeena
	Sep 16 2020, 4:06 PM

Description

This task is to discuss creating a registry for dev images.

Why:

The production image registry is reportedly hard to garbage collect and if dev images are published they could take up too much space
Recent gerrit activity shows that teams intend to publish dev images
- https://gerrit.wikimedia.org/r/c/releng/dev-images/+/626739
- https://gerrit.wikimedia.org/r/c/wikibase/termbox/+/627414
It is more convenient to have CI publish these images than everyone build them, and docker images are not 100% reproducible. Downloaded packages could change.

Options:

Figure out how to garbage collect images without tags from production registry more easily
Create a new registry for dev images
Do not allow CI to build dev images

Related Objects

Mentioned Here: T242604: Remove obsoleted docker images

Event Timeline

jeena created this task.Sep 16 2020, 4:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 16 2020, 4:06 PM

Aklapper added a project: dev-images.Sep 16 2020, 4:11 PM

jeena edited projects, added Release-Engineering-Team (Pipeline); removed dev-images.Sep 16 2020, 4:16 PM

I'm pretty sure I have seen other services having dev image published by the pipeline or at least with a tag similar to "latest". Here is kask with "stable" tag https://gerrit.wikimedia.org/r/c/mediawiki/services/kask/+/617564/1#message-a665dccdd07970b3e826c6d7e2ed12182d7e0ec6

I 'd suggest that we first define what we mean by "dev" images. Put differently, what would make an image "dev"?

The fact a developer built it and pushed it to the registry?
The fact that the dockerfile that builds it contains development related files (headers, libraries, etc)?
The use cases?
A docker tag? Added perhaps by the pipeline?

It is more convenient to have CI publish these images than everyone build them, and docker images are not 100% reproducible. Downloaded packages could change.

Exactly the above is true, IMHO we should not be encouraging people to rely on not well defined and non reproducible behavior and treat images as snowflakes/pets to be cherished. It will only lead to pain down the road.

In T263038#6468850, @Ladsgroup wrote:

I'm pretty sure I have seen other services having dev image published by the pipeline or at least with a tag similar to "latest". Here is kask with "stable" tag https://gerrit.wikimedia.org/r/c/mediawiki/services/kask/+/617564/1#message-a665dccdd07970b3e826c6d7e2ed12182d7e0ec6

That's because of https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/kask/+/refs/heads/master/.pipeline/config.yaml#26. The problems with that tag are of a similar nature as with "latest"[1]. Aka, it says "stable", but this is automatically added, as long as tests pass, which isn't necessarily the same as what a human might consider "stable". It might also be subject to race conditions, I am not sure if the pipeline would always serialize jobs resulting from merges.

[1] Here's a primer on latest, most of the arguments apply to this case as well with some modifications. https://vsupalov.com/docker-latest-tag/

Thanks for the detailed response, it clarified lots of things. As a developer I would care to have a tag that points to HEAD of master for two usecases:

Developer productivity, For example we have a detailed section on how to set up termbox in docker (out of the box) in README of termbox: https://github.com/wikimedia/wikibase-termbox#using-official-automatically-created-images the problem is that the docker image pointed there doesn't exists anymore (and it's one-year old: docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2019-08-24-040743-production. Having something for devs or users to use would be amazing
PM testability, We need to have a public test setup (I think it's https://termbox-beta.wmflabs.org) so our PM (or in some teams QA) can confirm stories are done, but without such tag it gets really complicated to have that and as the result of this situation, we had to write a really complicated bash to just pull the latest commit https://github.com/wikimedia/wikibase-termbox/blob/master/infrastructure/templates/updater.sh.j2 This seems wrong to me

Thanks for the replies all.

I've also been keeping in mind a plan to create automated, ephemeral UAT environments for patchsets, which would definitely require an image to be published before merge. I think this is a valid use case and has been wanted by developers for some time.

In T263038#6470530, @Ladsgroup wrote:

Developer productivity, For example we have a detailed section on how to set up termbox in docker (out of the box) in README of termbox: https://github.com/wikimedia/wikibase-termbox#using-official-automatically-created-images the problem is that the docker image pointed there doesn't exists anymore (and it's one-year old: docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2019-08-24-040743-production. Having something for devs or users to use would be amazing

If what you're looking for is a tag to the latest built production image, not considering race conditions, you can add any tag you like to the .pipeline/config.yaml in your repo via the tags: [] attribute in the publish step. Here's some docs on that: https://wikitech.wikimedia.org/wiki/PipelineLib/Reference

In T263038#6470530, @Ladsgroup wrote:

Thanks for the detailed response, it clarified lots of things. As a developer I would care to have a tag that points to HEAD of master for two usecases:

Developer productivity, For example we have a detailed section on how to set up termbox in docker (out of the box) in README of termbox: https://github.com/wikimedia/wikibase-termbox#using-official-automatically-created-images the problem is that the docker image pointed there doesn't exists anymore (and it's one-year old: docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2019-08-24-040743-production. Having something for devs or users to use would be amazing

Yes, that's fully understandable. We have a similar issues when crafting a helm chart. Up to now we 've been passing the issue to the developer of the chart and asking that they put a value there that makes sense to them. In some cases it's just a tag of the timestamp format that at that point in time is considered "stable". In other cases, no decision is taken and we got a "latest" in there requiring that the user of the helm chart overrides it.

The use case I am describing btw, is about being able to launch the service the image powers while doing work on a *dependent* service (e.g. working on mediawiki Cite extension while using the citoid image), not the the one the image powers. That latter case will need to be differently addressed.

On the other hand, we do know fully well that the mutability of docker tags has a potential of becoming a huge mess and we definitely don't want to end up using mutable tags like "stable", "latest", "alpha" in production, which is why we discourage it.

PM testability, We need to have a public test setup (I think it's https://termbox-beta.wmflabs.org) so our PM (or in some teams QA) can confirm stories are done, but without such tag it gets really complicated to have that and as the result of this situation, we had to write a really complicated bash to just pull the latest commit https://github.com/wikimedia/wikibase-termbox/blob/master/infrastructure/templates/updater.sh.j2 This seems wrong to me

While that use case makes again full sense to me, I am thinking that it's bound to be a bit more complicated than HEAD of master as teams grow and >1 user stories are being worked on. We 've been thinking on how to create infrastructure to facilitate testing/showcasing the result of patches. We are still in the discussing phase though.

In T263038#6472207, @jeena wrote:

Thanks for the replies all.

I've also been keeping in mind a plan to create automated, ephemeral UAT environments for patchsets, which would definitely require an image to be published before merge. I think this is a valid use case and has been wanted by developers for some time.

I 'd love to have that too, but in this case, those images are bound to the patchsets, ephemeral and not mutable. Garbage collecting them could become a problem though. We 'd have to look into it more. For posterity's sake, this is an orthogonal use case to the other use cases described in this task. Not even sure if we should be naming those images "dev" images, my gut feeling says no.

Considering the two types of images we've discussed publishing:

Images with dev dependencies built into them so that developers can use them to test changes or debug something. These images would be built by CI after every merge. I don’t think we need to keep around old images when a new one is published.

If we stuck to publishing an image with just the dev dependencies and no other code, utilizing cached layers we may only push a new image when a dev dependency changes instead of on every merge. Would that be more acceptable?

The question of why we don't have the dev environment build these came up. I don't think it's a good idea for a couple of reasons:
- Some of the requirements of the dev environment are fast set-up and run time and less resource usage. This is based on comments from the Developer Satisfaction Survey and comments from people during 1-1 interviews. Requiring developers to build each image (whether manually or automatically) would increase the time required to set up and run the environment. As a result of building the images, they would also be downloading all the base images from which to build upon, as well as installing blubber to generate the Dockerfiles.
- We already have a CI system set up to build images, and I think we should make use of it instead of designing another system to do the same thing somewhere else.
- It seems more efficient to build and publish these images once, instead of requiring every single developer to build their own version.

Ephemeral images published for each patchset, deleteable after a new patchset is pushed to the change, and/or the change is merged. I expect these images to be built without dev dependencies although they would be used during the dev's workflow. I'm not attached to any naming scheme.

Both 1 & 2 involve publishing images that are not intended to be permanent, so I’m not sure why they should be treated that much differently (I understood that the issue of garbage collection is what sparked this task).

In T263038#6473810, @akosiaris wrote:

I 'd love to have that too, but in this case, those images are bound to the patchsets, ephemeral and not mutable. Garbage collecting them could become a problem though. We 'd have to look into it more. For posterity's sake, this is an orthogonal use case to the other use cases described in this task. Not even sure if we should be naming those images "dev" images, my gut feeling says no.

I'm not sure what you mean by "bound to the patchsets". Were you meaning to say they represented a patchset, or just giving an explanation for their ephemeral quality? It’s not clear to me why doing garbage collection is okay for one but not the other. Is it something about possibly reusing a tag that makes this an issue?

In T263038#6490068, @jeena wrote:

Considering the two types of images we've discussed publishing:

Images with dev dependencies built into them so that developers can use them to test changes or debug something. These images would be built by CI after every merge. I don’t think we need to keep around old images when a new one is published.

I don't think we 've discussed this yet, have I missed something? In this task I think we 've discussed:

the issue of developing something (e.g. a mediawiki extension) that's dependent on an image (e.g. like termbox pointed out in T263038#6470530). This does not require dev dependencies and we already support this scenario. We could improve the UX however.
the issue of needing images showcasing/manual QA testing. We 'll need work on this but most of it is tangential to the images themselves and it highly depends on what kind of UX (my vision is something akin to https://patchdemo.wmflabs.org/) we want to provide and use cases we want to cover. So a bigger discussion overall. But I get the idea that 1 simple tag to refer to something showcasable and somewhat (I use this very liberally, I am trying to rule out race conditions and assumptions about delays between merging and showcasing) recent makes sense in the meanwhile. But this does not require dev dependencies.
the issue of automatic testing for patchsets. I believe the work being done on the kask integration tests already goes that way, still some work to be done and hurdles to be lifted but we are on the path already. But still, this does not require dev dependencies

I 've tried making some sense out of the 2 gerrit changes posted in the taks, but the first one was abandoned by the owner with the decision to use the pipeline images, and the second one references a blubber variant ("dev") that does not seem to exist, so I am not sure what it refers to. I get the feeling however that these 2 lines from the first patchset is what you are referring to?

RUN apt-get update && apt-get install -y "git" "build-essential" "python-dev" "librdkafka-dev" "librdkafka1" "librdkafka++1" "kafkacat" "telnet" "iputils-ping" "procps" "curl" "vim" && rm -rf /var/lib/apt/lists/*
RUN mkdir -p /srv/service/schemas/event && git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/primary /srv/service/schemas/event/primary && cd /srv/service/schemas/event/primary && git reset --hard 486912f && git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/secondary /srv/service/schemas/event/secondary && cd /srv/service/schemas/event/secondary && git reset --hard d300355

Am I correct in this assumption?

If we stuck to publishing an image with just the dev dependencies and no other code, utilizing cached layers we may only push a new image when a dev dependency changes instead of on every merge. Would that be more acceptable?

Not sure I follow. What do you mean by no other code?

The question of why we don't have the dev environment build these came up. I don't think it's a good idea for a couple of reasons:
Some of the requirements of the dev environment are fast set-up and run time and less resource usage. This is based on comments from the Developer Satisfaction Survey and comments from people during 1-1 interviews. Requiring developers to build each image (whether manually or automatically) would increase the time required to set up and run the environment. As a result of building the images, they would also be downloading all the base images from which to build upon, as well as installing blubber to generate the Dockerfiles.

each image? How many images are we talking about here? I am assuming that a dev env isn't going to setting up all the services btw, but rather gate their enabling via some toggles. But even in the case that a user is going to setup all services in their dev env, they should be downloading the production images for all the other services and just build the 1 image of the software they want to work on (2 images perhaps if they have 2 interacting services and want to work on both), sparring them the time to build all the images.

We already have a CI system set up to build images, and I think we should make use of it instead of designing another system to do the same thing somewhere else.

Absolutely agreed. Which is why the idea was always to have the dev env reuse the production images (and even the helm charts in fact) for all services except the one the user wanted to develop live on their env. We do however, as pointed out above, have the problem of not having currently a policy for how to address those images (i.e. a tag that means something) but rather avoided this up to now and thought that the default version field in the helm chart would suffice. We still haven't fully tested this approach however, maybe it makes sense to revisit it.

Even more, we also must assume that users will want to make changes to images while experimenting. And since hosting every experiment every dev has in mind is unfeasible, we will have to support their need to (re)build images in the dev env anyway

It seems more efficient to build and publish these images once, instead of requiring every single developer to build their own version.

Ephemeral images published for each patchset, deleteable after a new patchset is pushed to the change, and/or the change is merged. I expect these images to be built without dev dependencies although they would be used during the dev's workflow. I'm not attached to any naming scheme.

Both 1 & 2 involve publishing images that are not intended to be permanent, so I’m not sure why they should be treated that much differently (I understood that the issue of garbage collection is what sparked this task).

It's the churn rate that is the issue. Garbage collecting isn't currently fully fleshed out, has had a couple of rough edges already (e.g. T242604) and it's not automatic. Even if the images are ephemeral and deleted by the pipeline they leave behinds blobs that need to be GCed (see https://docs.docker.com/registry/garbage-collection/). Until we are at a point where we can support automatic and reliable GCing of those, we need to keep the churn rate low to avoid killing swift.

In T263038#6473810, @akosiaris wrote:

I 'd love to have that too, but in this case, those images are bound to the patchsets, ephemeral and not mutable. Garbage collecting them could become a problem though. We 'd have to look into it more. For posterity's sake, this is an orthogonal use case to the other use cases described in this task. Not even sure if we should be naming those images "dev" images, my gut feeling says no.

I'm not sure what you mean by "bound to the patchsets". Were you meaning to say they represented a patchset, or just giving an explanation for their ephemeral quality?

The former.

It’s not clear to me why doing garbage collection is okay for one but not the other.

As pointed out above, it's the churn rate that is the issue currently.

In T263038#6490646, @akosiaris wrote:
In T263038#6490068, @jeena wrote:

Considering the two types of images we've discussed publishing:

Images with dev dependencies built into them so that developers can use them to test changes or debug something. These images would be built by CI after every merge. I don’t think we need to keep around old images when a new one is published.

I don't think we 've discussed this yet, have I missed something? In this task I think we 've discussed:

the issue of developing something (e.g. a mediawiki extension) that's dependent on an image (e.g. like termbox pointed out in T263038#6470530). This does not require dev dependencies and we already support this scenario. We could improve the UX however.

the issue of needing images showcasing/manual QA testing. We 'll need work on this but most of it is tangential to the images themselves and it highly depends on what kind of UX (my vision is something akin to https://patchdemo.wmflabs.org/) we want to provide and use cases we want to cover. So a bigger discussion overall. But I get the idea that 1 simple tag to refer to something showcasable and somewhat (I use this very liberally, I am trying to rule out race conditions and assumptions about delays between merging and showcasing) recent makes sense in the meanwhile. But this does not require dev dependencies.

the issue of automatic testing for patchsets. I believe the work being done on the kask integration tests already goes that way, still some work to be done and hurdles to be lifted but we are on the path already. But still, this does not require dev dependencies

I 've tried making some sense out of the 2 gerrit changes posted in the taks, but the first one was abandoned by the owner with the decision to use the pipeline images, and the second one references a blubber variant ("dev") that does not seem to exist, so I am not sure what it refers to. I get the feeling however that these 2 lines from the first patchset is what you are referring to?
RUN apt-get update && apt-get install -y "git" "build-essential" "python-dev" "librdkafka-dev" "librdkafka1" "librdkafka++1" "kafkacat" "telnet" "iputils-ping" "procps" "curl" "vim" && rm -rf /var/lib/apt/lists/*
RUN mkdir -p /srv/service/schemas/event && git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/primary /srv/service/schemas/event/primary && cd /srv/service/schemas/event/primary && git reset --hard 486912f && git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/secondary /srv/service/schemas/event/secondary && cd /srv/service/schemas/event/secondary && git reset --hard d300355
Am I correct in this assumption?

I apologize - the comment about abandoning the patch in favor of the existing pipeline images slipped by me. But to answer your question, the Dockerfile from the patchset lines you referenced above is (with some modification due to not having already cloned the eventgate repo) what would be outputted from running blubber against the development variant for that repo (https://gerrit.wikimedia.org/r/plugins/gitiles/eventgate-wikimedia/+/refs/heads/master/.pipeline/blubber.yaml)

Those dev dependencies are what I thought was needed based on seeing those two patchsets, and the expectation that service developers would be mounting their files into the container to develop. Since it turns out that those assumptions were incorrect for both of those changes, I am trying to get more information on this from developers. I do not think the production images are suitable for someone developing in docker as they 1) don't have the dev packages installed, and 2) the runuser doesn't have permissions to alter anything in the workdir

If we stuck to publishing an image with just the dev dependencies and no other code, utilizing cached layers we may only push a new image when a dev dependency changes instead of on every merge. Would that be more acceptable?

Not sure I follow. What do you mean by no other code?

I meant that we could only install the dev dependencies such as apt and npm packages instead of copying the whole contents of the repo in question, and developers of the service would mount their files when developing. I thought this would cause less churn.

The question of why we don't have the dev environment build these came up. I don't think it's a good idea for a couple of reasons:
Some of the requirements of the dev environment are fast set-up and run time and less resource usage. This is based on comments from the Developer Satisfaction Survey and comments from people during 1-1 interviews. Requiring developers to build each image (whether manually or automatically) would increase the time required to set up and run the environment. As a result of building the images, they would also be downloading all the base images from which to build upon, as well as installing blubber to generate the Dockerfiles.
each image? How many images are we talking about here? I am assuming that a dev env isn't going to setting up all the services btw, but rather gate their enabling via some toggles. But even in the case that a user is going to setup all services in their dev env, they should be downloading the production images for all the other services and just build the 1 image of the software they want to work on (2 images perhaps if they have 2 interacting services and want to work on both), sparring them the time to build all the images.

I agree regarding toggling what they want and downloading production images for services not being worked on. It was my opinion that developers would also download the dev versions of the images that they wanted to work on, but if building them locally turns out to be convenient enough, then I don't have an issue with that. I misunderstood the needs represented by the patchsets listed in this task, and I will do more investigation on this.

It's the churn rate that is the issue. Garbage collecting isn't currently fully fleshed out, has had a couple of rough edges already (e.g. T242604) and it's not automatic. Even if the images are ephemeral and deleted by the pipeline they leave behinds blobs that need to be GCed (see https://docs.docker.com/registry/garbage-collection/). Until we are at a point where we can support automatic and reliable GCing of those, we need to keep the churn rate low to avoid killing swift.

Thanks for this info. Regarding the ephemeral images, in CI perhaps we could send a request to the api to delete the manifest for the image once a new patchset image is deployed, and then have GC run at a defined interval? Upon reading the documentation it seems like a maintenance window is necessary for the GC since it needs to be in read-only mode to avoid deleting newly updated layers. I'm sure you've thought about this already of course!

I'm not sure what you mean by "bound to the patchsets". Were you meaning to say they represented a patchset, or just giving an explanation for their ephemeral quality?

The former.

Thanks for the clarification. On a related note, I consider all images published after merge as being bound to a git commit.

So to conclude, I'll report back after doing some more investigation. Although, I assume there is still a question of whether we want/need a separate registry for ephemeral images in the future? I'm not sure of your opinion on that.

Setting as stalled for now, pending the investigation mentioned in the last comment.

brennen subscribed.Oct 13 2020, 8:33 PM

thcipriani edited projects, added Release-Engineering-Team (thcipriani-workboard-fiddling); removed Release-Engineering-Team (Pipeline).Apr 20 2021, 1:17 AM

thcipriani moved this task from thcipriani-workboard-fiddling to Seen (ARCHIVE) on the Release-Engineering-Team board.Apr 20 2021, 1:18 AM

thcipriani edited projects, added Release-Engineering-Team; removed Release-Engineering-Team (thcipriani-workboard-fiddling).

thcipriani edited projects, added Release-Engineering-Team (Seen); removed Release-Engineering-Team.Apr 20 2021, 3:23 PM

Dev images registryOpen, Stalled, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Dev images registry
Open, Stalled, Needs TriagePublic
Actions