Toolforge web containers have webservice-runner installed, which pulls Python 3 as a dependency. This adds unnecessary weight to the containers and occasionally causes some problems like T287421. It would be nice to get the Python runtime removed from non-Python containers before the next containers (likely either Debian 12/bookworm release or if production wikis move to php 8.0 or newer and we get a nice repo to make php 8.0+ images from) are created.
Description
Details
- Other Assignee
- Legoktm
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | taavi | T335352 Provide a PHP 8.2 image for Kubernetes toolforge-jobs and webservice | |||
Resolved | taavi | T335507 Build Bookworm based Toolforge Kubernetes images | |||
Resolved | taavi | T358320 [toolforge-webservice] Remove old webservice-runner code | |||
Resolved | taavi | T293552 Remove Python/webservice-runner from toolforge web containers |
Event Timeline
There are essentially two ways to get this done:
- Re-implement webservice-runner in a language like Go or Rust that does not require pulling in a runtime as a dependency
- Update webservice to specify the command and any flags by itself, and fully remove the dependency on a runner command inside the container
If possible, getting rid of the runner entirely seems like the best long-term option.
What burden does the "unnecessary weight" of python3 in each container create? Is this about bytes stored somewhere, conceptual complexity, runtime performance, or something else entirely?
Inside a Kubernetes container, webservice-runner becomes the "pid 0" process of the container and then launches whatever particular runtime is needed as a subprocess. What actually that entails varies a bit from runtime to runtime, but usually is a matter of setting up some configuration for a service by compositing built-in and mounted config files (lighttpd config, uwsgi config, etc) and then starting the service itself.
Instead of one python based entrypoint script that knows how to start everything we could put a runtime language specific entrypoint into each container, but honestly I'm not sure if that will make a whole lot of difference in any measure other than bytes on disk. I think the long term solution is moving to custom containers built via build packs or something similar where the configuration and entrypoint end up baked into the container image based on the stack for that container. I would much rather see energy put into a new future that gives us an exit from grid engine than into shaving a few megabytes off of the container sizes (assuming that's the weight to be removed).
Mostly I am concerned with accidental dependencies being included in the container that end up being part of the containers' "stable interface" that users expect, whether it's Python 3 itself or the system libraries it pulls in (IIRC we had some issues during the switch from Python 2 -> 3). The size reduction is just a side benefit for me.
In the proposed case of a standalone image (see T194953#7491536, no separate ticket yet) it would somewhat defeat the point if said standalone image had many system libraries plus Python 3 installed. It wouldn't defeat the point entirely, just there would be no benefit to using a standalone image vs a plain Python 3 image.
Instead of one python based entrypoint script that knows how to start everything we could put a runtime language specific entrypoint into each container, but honestly I'm not sure if that will make a whole lot of difference in any measure other than bytes on disk. I think the long term solution is moving to custom containers built via build packs or something similar where the configuration and entrypoint end up baked into the container image based on the stack for that container. I would much rather see energy put into a new future that gives us an exit from grid engine than into shaving a few megabytes off of the container sizes (assuming that's the weight to be removed).
Agreed on the long-term direction. I do think there is some value in splitting out the runner component or some refactoring to make it better isolated, because that is what will need to end up in the various buildpacks AIUI (ex: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/buildpacks/+/refs/heads/master/uwsgi/). Mostly it depends on how much work this ends up being, which I don't have a good idea of since it's been a year since I last closely looked into this. I suspect that some of these could easily be taken care of by a shell script, but again, need to look a bit more closely first.
By wstype current implementation:
- generic - runs whatever extra_args describes
- js - {/usr/local/bin/npm, /usr/bin/npm} start with cwd set to $HOME/www/js
- lighttpd-plain - builds /var/run/lighttpd/$TOOL config file possibly with $HOME/.lighttpd.conf concatenated; runs /usr/sbin/lighttpd -f /var/run/lighttpd/$TOOL -D
- lighttpd - same as lighttpd-plain but with php fastcgi config added before $HOME/.lighttpd.conf
- python - runs a long /usr/bin/uwsgi ... command with optional --venv and --ini args based on files seen on disk
- tomcat - /usr/bin/deprecated-tomcat-starter; only used on the grid engine?
- uwgsi - /usr/bin/uwsgi --http-socket :$PORT --logto $HOME/uwsgi.log --ini $HOME/uwsgi.ini ...; only used on grid engine?
So ... yeah, I think this could be replaced with a bit of bash scripting really. The lighttpd and python variants do the most setup work, but really not much.
Change 738503 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/docker-images/toollabs-images@master] python39: Use shell reimplementation of webservice-runner
I did Python 3.9 as a proof-of-concept, the same code should be reusable for the other Python versions.
generic seems simple but I'm not sure if we need to do anything special with quoting args or whatever. That's mostly why I didn't submit a WIP patch adding a standalone image using a generic shell webservice-runner.
Change 738503 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] python39: Use shell reimplementation of webservice-runner
Mentioned in SAL (#wikimedia-cloud) [2022-08-25T10:40:11Z] <taavi> tagged new version of the python39-web container with a shell implementation of webservice-runner T293552
Change 827007 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for python35/python37 images
Change 827009 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for node16 image
Thanks @taavi for deploying it, seems it worked fine! I put up a patch for the rest of the Python images and an initial patch for the newest node version if you want to review those :).
For the generic version, given an extra_args of [www/rust/run.sh], the runner is invoked as [/usr/bin/webservice-runner, --type, generic, --port, 8000, www/rust/run.sh]
We already hardcode/assume port 8000, so we can safely ignore the first 4 arguments. How defensive do we want to be in this processing? Should properly implement arg parsing of --type and --port in bash? Or can we assume they'll always be in that order and just skip the first 4 arguments? My current thinking is that if we do proper arg parsing for those two options it makes it easy to drop them in the future without needing to change the bash runner at the same time.
Change 827007 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for python35/python37 images
Change 829107 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for golang111 image
Change 827009 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for node16 image
Mentioned in SAL (#wikimedia-cloud) [2022-10-12T23:25:34Z] <bd808> Rebuilding all Toolforge docker images (T278436, T311466, T293552)
Change 872499 had a related patch set uploaded (by Legoktm; author: Legoktm):
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for remaining nodejs images
Change 829107 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for golang111 image
Change 872499 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for remaining nodejs images
Marked T335507: Build Bookworm based Toolforge Kubernetes images as a parent task since I'm aiming to get rid of the Python dependency in the Bookworm based images.
Change 912868 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for jdk17, ruby27 images
Change 912868 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] Use shell webservice-runner for jdk17, ruby27 images
We basically only have the lighttpd based images remaining. For that the start script and config needed are a bit longer than so I'm wondering if we want to take a different approach than copy-pasting the same script to all the images. A separate binary package in the toolforge-webservice package might be one option, and and a multi-stage Docker build would be another. Thoughts?
This seems like a lot of overhead for small gain to me, but I'm willing to hear why I'm missing something important in that.
a multi-stage Docker build would be another
I'm assuming that by this you mean using a published container image as the means of centralization? Something like:
COPY --from=docker-registry.tools.wmflabs.org/toolforge-webservice-runner:latest /srv/app/webservice-runner /usr/bin/
Thoughts?
The copy from container idea sounds neat, but could also introduce some trickiness of making sure a new build of the source container is completed before trying to build containers that would be copying from it.
I initially wondered if symlinks in the git repo could fix this issue, but then I rediscovered upstream discussion of why that's fragile and unimplemented. We do have the build.py script to manage image building here, so we could in theory add our own "copy files from $DIR before building" functionality to go along with the existing Dockerfile generation step.
Change 983520 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/docker-images/toollabs-images@master] shared: lighttpd: fix override file path
Change 983520 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] shared: lighttpd: fix override file path
Mentioned in SAL (#wikimedia-cloud) [2023-12-16T20:54:30Z] <bd808> Rebuilding all containers to pick up lighttpd config fix and normal package updates (T293552)
Change 1005952 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/docker-images/toollabs-images@master] Convert remaining images to shell webservice-runner
Change 1005952 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] Convert remaining images to shell webservice-runner
Mentioned in SAL (#wikimedia-cloud) [2024-03-06T10:34:20Z] <taavi> rebuilding all docker images for https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/1005952 (T293552) + normal package updates