Page MenuHomePhabricator

Get coverage artifacts from Kokkuri
Closed, ResolvedPublic

Description

We use Kokkuri for CI in mediawiki event enrichment. We need Gitlab to be able to retrieve coverage reports from the container built by Kokkuri.

In this patch, coverage reports are generated into /srv/app.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ok so just recounting my experiments:

I used a build flag to copy the insides of the kokkuri container into the gitlab container so we can get the artifacts:

BUILDCTL_BUILD_FLAGS: --output type=local,dest=/tmp

However, this hangs and then CI job times out. It might just be that the files are too large, so tried a multi-stage build with a scratch image.

test: # Named this test so I don't have to change gitlab CI for these experiments, should really be called something like test-with-coverage
  base: scratch
  copies:
   - from: before-test # Original renamed test pipeline
     source: /srv/app
     destination: /tmp

However, this doesn't run the tests since the test pipeline uses entrypoint which isn't executed in the middle of multi-stage builds. Replaced that with the only other way to run commands in blubber I found which is through builder:

builder:
  command: [make, tests]

This then blew up at an unexpected point:

#7 [before-test  1/11] FROM docker-registry.wikimedia.org/flink:1.17.0-wmf0@sha256:fd57e8da5453205b9d3155277063f26f2e2aa981009893d68a2d43348801c79c
#7 resolve docker-registry.wikimedia.org/flink:1.17.0-wmf0@sha256:fd57e8da5453205b9d3155277063f26f2e2aa981009893d68a2d43348801c79c 0.1s done
#7 DONE 0.1s
#8 [test 1/3] RUN (getent group "186" || groupadd -o -g "186" -r "flink") && (getent passwd "186" || useradd -l -o -m -d "/home/flink" -r -g "186" -u "186" "flink") && mkdir -p "/srv/app" && chown "186":"186" "/srv/app" && mkdir -p "/opt/lib" && chown "186":"186" "/opt/lib"
#8 ERROR: process "/bin/sh -c (getent group \"$LIVES_GID\" || groupadd -o -g \"$LIVES_GID\" -r \"$LIVES_AS\") && (getent passwd \"$LIVES_UID\" || useradd -l -o -m -d \"/home/$LIVES_AS\" -r -g \"$LIVES_GID\" -u \"$LIVES_UID\" \"$LIVES_AS\") && mkdir -p \"/srv/app\" && chown \"$LIVES_UID\":\"$LIVES_GID\" \"/srv/app\" && mkdir -p \"/opt/lib\" && chown \"$LIVES_UID\":\"$LIVES_GID\" \"/opt/lib\"" did not complete successfully: unable to find user 0: invalid argument
#9 [internal] helper image for file operations
#9 resolve docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061 0.1s done
#9 CANCELED
------
 > [test 1/3] RUN (getent group "186" || groupadd -o -g "186" -r "flink") && (getent passwd "186" || useradd -l -o -m -d "/home/flink" -r -g "186" -u "186" "flink") && mkdir -p "/srv/app" && chown "186":"186" "/srv/app" && mkdir -p "/opt/lib" && chown "186":"186" "/opt/lib":
------
error: failed to solve: process "/bin/sh -c (getent group \"$LIVES_GID\" || groupadd -o -g \"$LIVES_GID\" -r \"$LIVES_AS\") && (getent passwd \"$LIVES_UID\" || useradd -l -o -m -d \"/home/$LIVES_AS\" -r -g \"$LIVES_GID\" -u \"$LIVES_UID\" \"$LIVES_AS\") && mkdir -p \"/srv/app\" && chown \"$LIVES_UID\":\"$LIVES_GID\" \"/srv/app\" && mkdir -p \"/opt/lib\" && chown \"$LIVES_UID\":\"$LIVES_GID\" \"/opt/lib\"" did not complete successfully: unable to find user 0: invalid argument
2023-05-30 18:51:10,670 Command '['buildctl', '--timeout', '3600', '--wait-for-ready', '3600', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.16.0', '--local', 'context=.', '--local', 'dockerfile=.', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=test', '--output', 'type=local,dest=/tmp', '--opt', 'run-variant=true', '--opt', 'entrypoint-args=[]']' returned non-zero exit status 1.

When I convert my experimental blubber file into a docker file I get:

FROM docker-registry.wikimedia.org/flink:1.17.0-wmf0 AS before-test
USER 0
ENV HOME="/root"
ENV DEBIAN_FRONTEND="noninteractive"
RUN apt-get update && apt-get install -y "python3-pip" "curl" "make" "git" && rm -rf /var/lib/apt/lists/*
RUN python3 "-m" "pip" "install" "-U" "setuptools!=60.9.0" && python3 "-m" "pip" "install" "-U" "wheel" "tox" "pip"
ARG LIVES_AS="flink"
ARG LIVES_UID=186
ARG LIVES_GID=186
RUN (getent group "$LIVES_GID" || groupadd -o -g "$LIVES_GID" -r "$LIVES_AS") && (getent passwd "$LIVES_UID" || useradd -l -o -m -d "/home/$LIVES_AS" -r -g "$LIVES_GID" -u "$LIVES_UID" "$LIVES_AS") && mkdir -p "/srv/app" && chown "$LIVES_UID":"$LIVES_GID" "/srv/app" && mkdir -p "/opt/lib" && chown "$LIVES_UID":"$LIVES_GID" "/opt/lib"
ARG RUNS_AS="flink"
ARG RUNS_UID=186
ARG RUNS_GID=186
RUN (getent group "$RUNS_GID" || groupadd -o -g "$RUNS_GID" -r "$RUNS_AS") && (getent passwd "$RUNS_UID" || useradd -l -o -m -d "/home/$RUNS_AS" -r -g "$RUNS_GID" -u "$RUNS_UID" "$RUNS_AS")
USER $LIVES_UID
ENV HOME="/home/flink"
WORKDIR "/srv/app"
COPY --chown=$LIVES_UID:$LIVES_GID ["requirements.txt", "requirements-test.txt", "./"]
ENV PIP_FIND_LINKS="file:///opt/lib/python" PIP_WHEEL_DIR="/opt/lib/python"
RUN mkdir -p "/opt/lib/python"
RUN python3 "-m" "pip" "wheel" "-r" "requirements.txt" "-r" "requirements-test.txt" && python3 "-m" "pip" "install" "--target" "/opt/lib/python/site-packages" "-r" "requirements.txt" "-r" "requirements-test.txt"
ENV PATH="/opt/lib/python/site-packages/bin:${PATH}" PYTHONPATH="/opt/lib/python/site-packages"
COPY --chown=$LIVES_UID:$LIVES_GID ["./", "."]
RUN make "tests"
COPY --chown=$LIVES_UID:$LIVES_GID --from=docker-registry.wikimedia.org/flink:1.17.0-wmf0 ["/opt/flink/opt/flink-s3-fs-presto-1.17.0.jar", "/usr/local/lib/python3.9/dist-packages/pyflink/plugins/s3-fs-presto/flink-s3-fs-presto-1.17.0.jar"]
ENV PIP_NO_INDEX="1"

FROM scratch AS test
USER 0
ENV HOME="/root"
ARG LIVES_AS="somebody"
ARG LIVES_UID=65533
ARG LIVES_GID=65533
RUN (getent group "$LIVES_GID" || groupadd -o -g "$LIVES_GID" -r "$LIVES_AS") && (getent passwd "$LIVES_UID" || useradd -l -o -m -d "/home/$LIVES_AS" -r -g "$LIVES_GID" -u "$LIVES_UID" "$LIVES_AS") && mkdir -p "/srv/app" && chown "$LIVES_UID":"$LIVES_GID" "/srv/app" && mkdir -p "/opt/lib" && chown "$LIVES_UID":"$LIVES_GID" "/opt/lib"
ARG RUNS_AS="runuser"
ARG RUNS_UID=900
ARG RUNS_GID=900
RUN (getent group "$RUNS_GID" || groupadd -o -g "$RUNS_GID" -r "$RUNS_AS") && (getent passwd "$RUNS_UID" || useradd -l -o -m -d "/home/$RUNS_AS" -r -g "$RUNS_GID" -u "$RUNS_UID" "$RUNS_AS")
USER $LIVES_UID
ENV HOME="/home/somebody"
WORKDIR "/srv/app"
COPY --chown=$LIVES_UID:$LIVES_GID --from=before-test ["/srv/app", "/artifacts"]
USER $RUNS_UID
ENV HOME="/home/$RUNS_AS"

LABEL blubber.variant="test" blubber.version="0.18.0+0734952"

2 things that stand out to me:

  • One of the RUN commands is missing a closing parenthesis which I think is what's causing the error (did I make a typo somewhere?)
  • make tests now runs before all the files are copied over, so builder isn't the right option to use. I don't know if there's a way to run a command and the very end of a stage.

@dancy am I missing something?

JArguello-WMF triaged this task as Medium priority.
JArguello-WMF moved this task from Backlog to Sprint 14 B on the Event-Platform board.

@tchin Just wanted to acknowledge receipt of your message though I wasn't able to focus on it today.

@tchin an alternative path for coverage reporting could be integrating with https://gitlab.wikimedia.org/repos/releng/docpub/-/blob/main/README.md and linking the published coverage report from Gitlab (we'd lose metric reporting in badge - but so be it).

Hm, when will we publishing docs? I had assumed just on tag releases? The coverage is probably more useful on main?

Ok so just recounting my experiments:

I used a build flag to copy the insides of the kokkuri container into the gitlab container so we can get the artifacts:

BUILDCTL_BUILD_FLAGS: --output type=local,dest=/tmp

However, this hangs and then CI job times out.

Can you point me to the job output showing the hang?

I'm wondering if the hang is due to some interaction between the --wait-for-ready implementation we added to our buildkit fork and the --output type=local option. I'll run some tests locally to confirm or disprove this theory.

In the meantime, there are some issues with the current approach that can be ironed out. Who knows, they may have something to do with the hang as well.

First, using --output type=local,dest=/tmp causes the client to request and output the entire container filesystem to /tmp. That's going to be Gbs in size and definitely not what you want to store in the GitLab artifact store which is super slow at storing and retrieving in general. The workaround for this in Blubber is usually to use a copies.from in combination with a scratch variant, to isolate only the files that you really want:

version: v4
variants:
  build-my-thing:
    base: some-big-build-environment
    builders:
      # ...
  just-my-build-artifact:
    base: ~ # explicitly denotes a scratch image (~ is yaml for null)
    copies:
      - from: build-my-thing
        source: /srv/app/my-build-artifact
        destination: /my-build-artifact

If you were to build just-my-build-artifact with --output type=local,dest=/tmp you would only see the my-build-artifact file in that directory, not a massive image filesystem.

However, this isn't the right approach here because we're relying on the --opt run-variant=true feature that we hacked into Blubber to run the entrypoint. That entrypoint process runs last and must be that of the target variant, not of a variant that is referenced by a copies.from directive.

I think the correct approach here is to move to a pattern of building your test variant in one GitLab CI stage and executing it in another (a pattern which we need to document ASAP). That would look something like this in your CI config:

build-test-runner:
  extends: .kokkuri:build-and-publish-image
  stage: test
  variables:
    BUILD_VARIANT: test
  tags:
    - kubernetes
run-tests:
  image: ${BUILD_TEST_RUNNER_IMAGE_REF}
  stage: test
  needs: [build-test-runner]
  script:
    - make tests
  # Match coverage total from job log output.
  # See: https://docs.gitlab.com/ee/ci/yaml/index.html#coverage
  coverage: '/^TOTAL.+?(\d+\%)$/'
  # Publish coverage.xml as a artifact so it can be used in the Gitlab CI UI.
  artifacts:
    when: always
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage.xml
      junit: junit_pytest_report.xml
  tags:
    - kubernetes

A .kokkuri:build-and-publish-image job will build the given variant and push it to the image repository that is available in the runner's environment and export a new variable containing the published image ref; the variable is named after the job and suffixed with _IMAGE_REF. Jobs in subsequent stages (or those that are downstream according to needs) can reference the *_IMAGE_REF variable in the job's image field thus running the blubber built variant in a normal GitLab CI context. This allows you to make use of all the normal GitLab-y features like reports and artifacts.

For the moment, we only have image caching repos available on the Digital Ocean runner infrastructure (and technically in prod but you don't want to populate our repo with a bunch of test runner images), so this pattern is only going to work there. meaning you need to add a kubernetes tag to your jobs (but not the one with the trusted tag of course).

I hope this all makes sense. Feel free to reach out on IRC or Slack if you want more info about it. We're working on documenting this pattern as well.