Page MenuHomePhabricator

Error when using multi arch build on gitlab with blubber and kokkuri
Open, Needs TriagePublic

Description

Trying to build a multi-arch elasticsearch image (linux/amd64 linux/arm64) for developers working on CirrusSearch I encountered the following error:

#12 [linux/arm64 2/5] RUN (getent group "1000" || groupadd -o -g "1000" -r "elasticsearch") && (getent passwd "1000" || useradd -l -o -m -d "/home/elasticsearch" -r -g "1000" -u "1000" "elasticsearch") && mkdir -p "/usr/share/elasticsearch" && chown "1000":"1000" "/usr/share/elasticsearch" && mkdir -p "/opt/lib" && chown "1000":"1000" "/opt/lib"
#0 1.207 standard_init_linux.go:219: exec user process caused: exec format error
#12 ERROR: process "/bin/sh -c (getent group \"$LIVES_GID\" || groupadd -o -g \"$LIVES_GID\" -r \"$LIVES_AS\") && (getent passwd \"$LIVES_UID\" || useradd -l -o -m -d \"/home/$LIVES_AS\" -r -g \"$LIVES_GID\" -u \"$LIVES_UID\" \"$LIVES_AS\") && mkdir -p \"/usr/share/elasticsearch\" && chown \"$LIVES_UID\":\"$LIVES_GID\" \"/usr/share/elasticsearch\" && mkdir -p \"/opt/lib\" && chown \"$LIVES_UID\":\"$LIVES_GID\" \"/opt/lib\"" did not complete successfully: exit code: 1
#11 [linux/amd64 2/5] RUN (getent group "1000" || groupadd -o -g "1000" -r "elasticsearch") && (getent passwd "1000" || useradd -l -o -m -d "/home/elasticsearch" -r -g "1000" -u "1000" "elasticsearch") && mkdir -p "/usr/share/elasticsearch" && chown "1000":"1000" "/usr/share/elasticsearch" && mkdir -p "/opt/lib" && chown "1000":"1000" "/opt/lib"
#11 1.161 elasticsearch:x:1000:
#11 1.164 elasticsearch:x:1000:1000::/usr/share/elasticsearch:/bin/bash
#11 DONE 1.2s

which seems to suggest that the gitlab runner is not able to emulate the arm64 arch.
Full build: https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image/-/jobs/89862

Locally the arm64 does seem to be built properly with blubber using buildx and after installing tonistiigi/binfmt:

docker run --privileged --rm tonistiigi/binfmt --install all
docker buildx build --platform=linux/arm64 --target devel --tag docker-registry.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image:mylocalbuild  -f .pipeline/blubber.yaml .

Looking at others builds in gitlab I believe that it's possible to build an image for arm64 looking at blubber build-frontend but it's not clear to me how this has to be setup.

I know that there might be blockers further down the road with issues like T322453 but I believe that I'm hitting an issue earlier at the build stage.

Event Timeline

@dcausse thanks for filing this.

Yes, I've been trying to get Blubber's multiplatform image published for a while now, but the problem there was related to the config of our registry's nginx frontend, a separate issue from the one you're seeing.

In BuildKit frontend gateways—the component responsible for ingesting the user provided image config and turning that into low level build (LLB) instructions—there is a build platform and a target platform. The former is inferred from the first available worker in buildkitd's worker pool. The latter is provided by the client. In our case, the build platform is always going to be linux/amd64 unless we roll out CI nodes that are of a different arch which we're not planning to.

The error you're seeing occurs when a binary is executed during the build process that is of a different arch than the build platform and the binfmt_misc module is not loaded and configured. I'm not certain, but I believe this must be on the build container's host system or with sufficient privileges to /proc/sys/fs, which is what your use of docker run --privileged --rm tonistiigi/binfmt --install all and the buildkit docs suggest. This would take some research and experimentation to see if we could get that working on the Digital Ocean cloud and/or WMF trusted runners.

In the meantime, the reason our Blubber multi-platform build works is because it relies solely on cross-compiling and doesn't need to _execute_ any non-amd64 binaries. It makes use of the environment variables exposed to build processes by BuildKit that specify details of both the build platform and target platform (OS, arch, arm64 revision, etc.).

See https://gitlab.wikimedia.org/repos/releng/blubber/-/blob/0734952947504560483c65edbc7f0da4634dfbdb/Makefile#L22 and https://docs.docker.com/engine/reference/builder/#automatic-platform-args-in-the-global-scope

I hope that helps in the case you're able to cross compile ES without emulation, but in either case I'd be down to pair on this with you and see if we can't get some solution working.

Aklapper renamed this task from Error when using mutli arch build on gitlab with blubber and kokkuri to Error when using multi arch build on gitlab with blubber and kokkuri.Apr 8 2023, 9:29 AM

@dduvall thanks for the clarifications this does help a lot!

My build is very minimal as everything is already built (java jar files and thus already multi-arch) via another repo and we just need to unzip a deb package in particular folder.
I tried to move all the custom building steps into a buildvariant and then use the "copies" keyword for the production variant (the one I want to be multi-arch) but I still have troubles with:

  • blubber lives and runs config blocks seem to generate a command like:
    • RUN (getent group "65533" || groupadd -o -g "65533" -r "somebody") && (getent passwd "65533" || useradd -l -o -m -d "/home/somebody" -r -g "65533" -u "65533" "somebody") && mkdir -p "/usr/share/elasticsearch/plugins" && chown "65533":"65533" "/usr/share/elasticsearch/plugins" && mkdir -p "/opt/lib" && chown "65533":"65533" "/opt/lib"
    • which is I think executed within the arm64 image and then are causing the error: standard_init_linux.go:219: exec user process caused: exec format error. My understanding is that if I want to use copies I must be setting lives.in, in my cases all the folders/users are existing, would there be a way to skip this detection and tell blubber to trust my image?
  • I'm not quite clear on how to define the arch with a multi-stage build scenario, what I understand I need is:
    • run the build variant on amd64
    • reuse that variant and build the production variant for arm64 and amd64

I tried this with kokkuri:

test-arm64:
  extends: .kokkuri:build-image
  stage: test
  variables:
    BUILD_VARIANT: production
    BUILD_TARGET_PLATFORMS: linux/arm64

For reference https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image/-/merge_requests/2 is the MR where I tried to switch from using the include to the copies logic to minimize the commands ran when building the production image.

I'd be happy to pair on this if you have time :)

Thanks!

  • which is I think executed within the arm64 image and then are causing the error: standard_init_linux.go:219: exec user process caused: exec format error. My understanding is that if I want to use copies I must be setting lives.in, in my cases all the folders/users are existing, would there be a way to skip this detection and tell blubber to trust my image?

This is making a lot more sense to me now. Another aspect to why the blubber multi-platform build is probably working and this isn't is that blubber's variant is based on a scratch image (no base), and user/group creation is skipped in that case. So yeah, if we had an option to skip the user/group creation, that would be another viable workaround.

The better long-term solution is still likely going to be getting binfmt_misc working, but we can tackle this from a couple of different angles.

I'd be happy to pair on this if you have time :)

Thanks!

No problem! I don't have much time this week, but I'll throw something on the calendar for next week. Feel free to suggest other days/times. It doesn't seem like we have much overlap in working hours, but I'm sure we'll figure something out. :)