Page MenuHomePhabricator

Fix Toolhub developer environment tooling to work on Linux hosts
Closed, ResolvedPublicBUG REPORT

Description

The current Toolhub developer environment of docker-compose managed containers with overlay mounts to let the containers change files on the git clone hosted on the same device as the docker stack relies on specific features of Docker that work differently across versions and platforms. Some of this may end up being hard requirements (for example you need Docker version X to get feature Y), but other incompatibilities are really just an artifact of the primary maintainer of these bits (@bd808) working from a Macbook running OSX and not having tried to make the developer experience work from a Linux host of any kind.

Known issues:

  • "profiles" feature used in docker-compose.yaml to mark some containers as optional is not supported by docker-compose v1.25.0 which is readily available via Debian apt repos.
  • Mounted file system overlays on Linux Docker hosts need ownership mapping to allow the container to write to the mount.

Event Timeline

@Lucas_Werkmeister_WMDE started https://gerrit.wikimedia.org/r/c/wikimedia/toolhub/+/732384 for the profiles issue. Per my comments on that patch, I think that we should move these optional definitions completely out of the main docker-compose and into files that can be loaded optionally as needed. Both of these optional testing helpers already have directories under contrib/ where it would be reasonable to place the files and add documentation.

Mounts likely can be fixed by some variation on the user: "${MW_DOCKER_UID}:${MW_DOCKER_GID}" config recommended at https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/Customize_base_image

bd808 changed the task status from Open to In Progress.Nov 8 2021, 8:27 PM
bd808 claimed this task.
bd808 triaged this task as High priority.
bd808 moved this task from Backlog to In Progress on the Toolhub board.
bd808 added a subscriber: Raymond_Ndibe.

Marking as high priority and starting work on this as it is a blocker for @Raymond_Ndibe contributing and reviewing patches.

Change 737531 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] dev: Move prometheus profile to separate docker-compose config

https://gerrit.wikimedia.org/r/737531

Change 737532 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] dev: Move oauth profile to separate docker-compose config

https://gerrit.wikimedia.org/r/737532

Mounts likely can be fixed by some variation on the user: "${MW_DOCKER_UID}:${MW_DOCKER_GID}" config recommended at https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/Customize_base_image

This bit is even uglier than I thought for our "nodejs" container. The docker-compose trick of setting the container's runtime uid & gid to match the local user causes npm to break inside the container due to permissions mismatch for the /home/somebody/.npm directory which is inside the container itself. Attempting to fix that by mounting a volume over the top of this directory does not fix the problem as hoped because docker-compose mounts the volume into the container with root:root ownership and this triggers an internal npm check for root owned files in its cache location.

There is support for changing the uid & gid of the mount in newer versions of the docker-compose spec (https://docs.docker.com/compose/compose-file/compose-file-v3/#volumes), but I haven't been able to work out yet how to get a version of docker-compose that supports that newer spec installed on my Debian Bullseye test system.

There is support for changing the uid & gid of the mount in newer versions of the docker-compose spec (https://docs.docker.com/compose/compose-file/compose-file-v3/#volumes), but I haven't been able to work out yet how to get a version of docker-compose that supports that newer spec installed on my Debian Bullseye test system.

On Debian Bullseye, apt-get install docker-compose results in an installation of version 1.25.0 (released 2019-11-18). 1.29.2 is the latest 1.x version. But... this may all be moot because I can no longer find the documentation I thought I saw about support for setting gid and uid keys when mounting a volume.

A hack that @Raymond_Ndibe is using until we find a better fix is changing the UID for the runtime user in the .pipeline/dev-nodejs.Dockerfile (the USER 65533 line) to match the UID of the account that is running docker-compose up (UID 1000 in his case). This change makes the runtime effective UID inside the Docker container match the ownership permissions of the mounted directory at /srv/app. This is also what using the user: "${MW_DOCKER_UID}:${MW_DOCKER_GID}" config does, but with one important difference--Raymond's change happens as the container is built so that the npm install which happens as part of the build process is done as UID 1000. Setting user: ... in the docker-compose.yaml only takes effect at runtime for the container so the npm cache and other files changed by the npm install during container creation are still done as UID 65533.

But... this may all be moot because I can no longer find the documentation I thought I saw about support for setting gid and uid keys when mounting a volume.

I think I found the doc that I confused with volumes at https://docs.docker.com/compose/compose-file/compose-file-v3/#long-syntax-2. You can set the effective UID and GID of a mounted secret, but apparently not a volume.

[23:19]  <    bd808> I am repeatedly smashing my head against the desk trying to work out a fix for a problem running the Toolhub dev environment on a Linux host. Maybe talking out loud here will either help me figure out what to try next or invite help from someone.
[23:20]  <    bd808> The basic problem is getting a docker-compose stack running containers built with blubber to let Docker on Linux mount a git clone read-write into the container at /srv/app.
[23:21]  <    bd808> On an OSX host, Docker somehow magically manages the uid/gid difference and things "just work"
[23:21]  <    dancy> hehe
[23:21]  <    bd808> on a Linux host, the uid/gid mismatch between the container and the host makes it all go boom
[23:22]  <    bd808> so then you do the trick of setting the effective UID/GID in the docker-compose file for the container
[23:22]  <    bd808> get us a step closer, and r/w should work
[23:23]  <    bd808> but... now there is a new problem with uid/gid and perms. The container runs `npm install` things both at container build and at container start.
[23:24]  <    bd808> The run that happens during build creates $HOME/.npm cache bits owned by the UID that blubber sets
[23:24]  <    bd808> when the container is later started as a different UID that cache is not readable and npm goes boom
[23:25]  <    bd808> so the next trick to try is mounting a volume over the $HOME/.npm location
[23:25]  <    bd808> should be a nice hack right? get an empty (or even persistent) $HOME/.npm cache of things for the runtime
[23:26]  <    bd808> but... Docker stabs me in the heart again by mounting the volume with root:root owenrship
[23:26]  <    bd808> so the effective UID can't write there and also npm has some "oops I see root owned things in the directory, bye" logic
[23:27]  <    bd808> This is where I'm now out of ideas.
[23:28]  <    bd808> "buy Raymond a mackbook" is looking like an elegant, but difficult to scale, solution :/
[23:30]  <    bd808> The runtime npm error is https://phabricator.wikimedia.org/P17714
[23:30]  <    bd808> which comes with a helpful sudo fix, but of course the EUID user has no sudo rights so that's right out
[23:40]  <    dancy> Sounds like what you need most is to run code inside the container using the same uid as outside the container.
[23:42]  <    bd808> I'm trying a new idea for the $HOME/.npm issue and I think it may work. The idea is to create a /tmp/runtime-home as the EUID, export HOME=/tmp/runtime-home, and then ?? profit ??
[23:42]  <    bd808> I need to undo some other things I have tried here before I can see this work or fail.
[23:42]  <    dancy> Using a path under /tmp is a great workaround for the npm stuff..
[23:49]  <    bd808> omg... I think this may be working...
[23:55]  <    bd808> w00t w00t w00t. I haz a running container. It wrote new files to the mounted external directory. and the ownership looks ok inside and out.
[23:55]  <    dancy> Nice work
[23:55]  <    bd808> now to package up these changes as a patch and see if they work for Raymond
[23:56]  <    bd808> thank you dancy for being a rubber duck
[23:56]  <    dancy> np
[23:56]  <    bd808> https://en.wikipedia.org/wiki/Rubber_duck_debugging for anyone confused by that statement

Change 737809 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] dev: Run web and nodejs containers as host UID:GID

https://gerrit.wikimedia.org/r/737809

Change 737531 merged by jenkins-bot:

[wikimedia/toolhub@main] dev: Move prometheus profile to separate docker-compose config

https://gerrit.wikimedia.org/r/737531

Change 737532 merged by jenkins-bot:

[wikimedia/toolhub@main] dev: Move oauth profile to separate docker-compose config

https://gerrit.wikimedia.org/r/737532

bd808 moved this task from In Progress to Review on the Toolhub board.

Change 737809 merged by jenkins-bot:

[wikimedia/toolhub@main] dev: Run web and nodejs containers as host UID:GID

https://gerrit.wikimedia.org/r/737809

bd808 updated the task description. (Show Details)

@Raymond_Ndibe reports being able to initialize and use the dev environment following these changes.

Change 738588 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] dev: Use runtime HOME override in web container

https://gerrit.wikimedia.org/r/738588

Change 738588 merged by jenkins-bot:

[wikimedia/toolhub@main] dev: Use runtime HOME override in web container

https://gerrit.wikimedia.org/r/738588

Change 749220 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2021-12-20-122341-production

https://gerrit.wikimedia.org/r/749220

Change 749220 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2021-12-23-121200-production

https://gerrit.wikimedia.org/r/749220