While full image build (should usually only happen for train deployments and/or l10n updates) will take a long time, and I think that is generally acceptable, most of the times even a simple backport deployment can take up to 20 minutes on k8s at the moment.
Specifically:
- the image build/publish is fast (< 1 minute)
- the deployment are randomly varying between very fast (< 1 minute) and slow (~ 8 minutes)
The reason for the occasional slowness is that sometimes we allocate a pod to a k8s node that hasn't downloaded the "fat" base image before and needs to download and extract a 7 GB set of image layers. That takes a long time. Clearly this slowness is unacceptable for a deployer.
I see two paths to try to ease the problem
Pre-pulling images
With every scap deployment, we distribute the command to all k8s nodes to pull the newest mediawiki multiversion image (and maybe prune old ones up to N versions before), then do a forced redeployment of the pods using some varying annotation, which should happen "as fast as possible" at that point.
the pre-pulling should only take a relatively long time when running on an image rebuilt from scratch, so I guess when running in the automated train presync process.
Use shared volumes on the k8s nodes, distribute code via scap
This is self-explanatory I think - basically the idea would be to give up on building the code into the docker images, and just mount it as a local read-only volume in k8s
Given this second option wasn't preferrred by most people, I'll try to explore the first option as a starter