Page MenuHomePhabricator

Summarize experiments with buildkit based MediaWiki image builds
Closed, ResolvedPublic

Description

As part of the larger set of experiments around containerizing MediaWiki, I've developed a proof of concept within the Blubber project (a utility called "flense") that would provide for the composition and building of a single MediaWiki version with all of its components (extensions, skins, vendor) included in a way that's similar to what mediawiki/tools/release provides but for OCI/Docker images.

Abstract

Implement a process for building single-version MediaWiki images that:

  1. Integrates all MW components (extensions, skins, vendor) from a local context of source files (single-version mediawiki directory hierarchy).
  2. Allows build step injection from any component.
  3. Delegates to component Blubber configuration where possible, avoiding duplication.
  4. Can reuse caches from upstream execution of component build steps.
  5. Is relatively fast (for a CI job, < 5 min).

Implementation of the above process uses the existing Blubber codebase but introduces a new subcommand and module called flense and flenser (respectively). The new feature relies on a previously implement experimental support for BuildKit, Moby's "next generation" image build platform.

Results were mixed. Delegation and caching worked quite well, with component subgraphs being completely based on their own Blubber configurations and atomic with regards to cache key computation. Dispatch of the overall build graph, however, suffered from poor performance due to the transfer of a large local context for all of the components (~ 2G). There were also a couple of issues discovered with BuildKit itself (bugs and limitations).

Another similar approach could be attempted that pulls component sources from Git instead of the local filesystem (avoiding the slowness of context transfer). However, the experimental nature of BuildKit is a serious concern.

Method

Flense implementation

Implementation of the process was based in the existing Blubber codebase but with an experimental patch for BuildKit support applied. A new subcommand and module called flense and flenser were implemented within the project to perform the following.

  1. Read in a local static YAML manifest file that describes a number of component sources relative to the current working directory.
  2. Build each component according to its builder type.
    1. copy type components simply have their source directory copied as is to the final image.
    2. blubber type components have their Blubber configs read and the given variant compiled as an LLB subgraph, allowing for complete delegation of any defined build step according to configuration provided in the subcomponent source.
  3. Copy the results of all components to a destination directory within a final scratch image.

The flense CLI can be compiled and installed from the experimental Blubber patchset after first installing Go tools:

git init src/blubber
cd src/blubber
git fetch "https://gerrit.wikimedia.org/r/blubber" refs/changes/62/640562/6 && git checkout FETCH_HEAD
make clean install
Build context and manifest

The existing wikimedia/production repo was re-used in this experiment as the local build context as it already contains all subcomponent sources as submodules and could be modified to include a flense manifest. (This repo was set up in 2019 as a Gerrit superproject to comprise an up-to-date source for all production-deployed MediaWiki core, extensions, skins, and vendor. It relies on an accurately maintained .gitmodules to track its subprojects.)

A [[ https://gerrit.wikimedia.org/r/c/wikimedia/production/+/642565 | manifest was added to wikimedia/production in another WIP patchset ]] and includes components for all extensions, skins, and vendor. All but extensions/Popups is of the copy type.

destination:
  root: /srv/mediawiki
  exclude:
    - .git
    - docs/
    - node_modules/
defaults:
  builder: copy
components:
  - source: mediawiki
    destination: .
  - extensions/3D
  - extensions/AbuseFilter
# [...]
  - source: extensions/Popups
    builder: blubber
    configPath: .pipeline/blubber.yaml
    variant: distribution
# [...]
  - skins/CologneBlue
# [...]
  - vendor

Other aspects of the manifest to note are:

  1. A base destination directory is defined as destination.root.
  2. A set of patterns for files and directories to exclude from the final image can also be given as destination.exclude.
  3. Default fields for all components can be given as defaults.[component field].
  4. Each component contains a source directory path, a destination in the final image relative to the top-level destination.root, a builder type of either copy or blubber (as previously explained). Components that build with Blubber also have specified a configPath and variant for the Blubber configuration file and variant for which to compile a subgraph.
Blubber based build steps

Only one component (extensions/Popups) currently has a build step (and builder: blubber) defined here but there is a growing need for other extensions to have something similar at some point prior to deployment (see {T199004: RFC: Add a frontend build step to skins/extensions to our deploy process}).

To support this experiment, a patchset was submitted to extensions/Popups that defines a Node based build step using Blubber configuration. The Blubber config is copied below.

version: v4
variants:
  prep:
    lives:
      in: /srv/mediawiki
    base: docker-registry.wikimedia.org/nodejs10-devel
    node:
      requirements: [package.json, package-lock.json]
    builder:
      command: [npm, run-script, build]
      requirements:
        - src/
        - resources/
        - webpack.config.js

  distribution:
    lives:
      in: /dist
    copies:
      - from: prep
        source: /srv/mediawiki
        destination: /dist

For these components, flense compiles the blubber variant (in this case, distribution) to its own LLB graph before aggregating all subgraphs.

Aggregation and final build graph

After aggregating component subgraphs and amending the destination copy operations to produce the final image, flense encodes and outputs the entire LLB build graph to protobuf, a format readable by buildctl, the frontend CLI for BuildKit. Output can be piped directly to buildctl to dispatch the build, assuming and buildkitd instance is accessible and BUILDKIT_HOST has been set.

Working example (following compilation/installation of flense from before):

git init src/wikimedia/production
cd src/wikimedia/production
git fetch "https://gerrit.wikimedia.org/r/wikimedia/production" refs/changes/65/642565/1 && git checkout FETCH_HEAD
git submodule update --init --recursive --jobs 10
docker run --rm -d --name buildkitd --privileged moby/buildkit:master
export BUILDKIT_HOST=docker-container://buildkitd
flense .pipeline/manifest.yaml | buildctl build --local context=.

Results

Build graph structure

Delegation to Blubber configurations in component repos allows components to define their own build steps, and the subgraphs resulting from Blubber variant compilation are roots of the overall aggregate graph, allowing them to be cached and executed independently from other components. The structure of the build graph can be clearly seen using buildctl debug dump-llb -dot.

flense .pipeline/manifest.yaml | buildctl debug dump-llb -dot | dot -Tpng > graph.png

graph.png (1×2 px, 234 KB)

Build execution

The resulting LLB graph is successfully dispatched by buildctl to buildkitd workers and results in a final image that contains all MediaWiki components.

$ flense .pipeline/manifest.yaml | buildctl build --local context=.
[+] Building 312.2s (19/19) FINISHED
 => [prep  1/10] FROM docker-registry.wikimedia.org/nodejs10-devel@sha256:ab1cba594bc26b230e888f490f4bcf1e01ebecf725f785e62a2173f85d2  19.9s
 => => resolve docker-registry.wikimedia.org/nodejs10-devel@sha256:ab1cba594bc26b230e888f490f4bcf1e01ebecf725f785e62a2173f85d237f86     1.7s
 => => sha256:0ea645d5a1950c64eb4f8725ebf17ff13efd4f1f098650b6ec66be05ac667926 26.23MB / 26.23MB                                        3.3s
 => => sha256:e590e63468908ac9727df3a67c671d64cc04df520b6469cea29dd1ff229234c2 2.55kB / 2.55kB                                          0.1s
 => => sha256:5c86276767f395395304e7c56cf42cb728b6ebb4f0c75894c555883276ddbd74 24.14MB / 24.14MB                                        3.1s
 => => extracting sha256:5c86276767f395395304e7c56cf42cb728b6ebb4f0c75894c555883276ddbd74                                               7.4s
 => => extracting sha256:e590e63468908ac9727df3a67c671d64cc04df520b6469cea29dd1ff229234c2                                               0.2s
 => => extracting sha256:0ea645d5a1950c64eb4f8725ebf17ff13efd4f1f098650b6ec66be05ac667926                                               8.9s
 => [internal] helper image for file operations                                                                                         1.2s
 => => resolve docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061          3.2s
 => => sha256:f7b6696c3fee7264ec4486cebe146a6a98aa8d1e46747843107ff473aada8d56 861.00kB / 861.00kB                                      0.3s
 => => sha256:df3b4bed1f63b36992540a09e0d10bd3f9d0b082d50810313841d745d7cce368 898.21kB / 898.21kB                                      0.5s
 => => extracting sha256:df3b4bed1f63b36992540a09e0d10bd3f9d0b082d50810313841d745d7cce368                                               0.4s
 => => extracting sha256:f7b6696c3fee7264ec4486cebe146a6a98aa8d1e46747843107ff473aada8d56                                               0.3s
 => transfer local context for (193) copy components                                                                                   91.1s
 => => transferring context: 944.35MB                                                                                                  90.4s
 => transfer local context for component (extensions/Popups)                                                                            1.3s
 => => transferring context: 4.65MB                                                                                                     1.0s
 => preparing component (extensions/Popups) for blubber build                                                                           0.5s
 => [prep  2/10] RUN groupadd -o -g "65533" -r "somebody" && useradd -l -o -m -d "/home/somebody" -r -g "somebody" -u "65533" "somebo  23.6s
 => [prep  3/10] RUN groupadd -o -g "900" -r "runuser" && useradd -l -o -m -d "/home/runuser" -r -g "runuser" -u "900" "runuser"        1.2s
 => [prep  4/10] COPY --chown=65533:65533 [package.json, package-lock.json, ./]                                                         0.9s
 => [prep  5/10] RUN npm install                                                                                                      152.3s
 => [prep  6/10] RUN mkdir -p "resources/" "src/"                                                                                       2.2s
 => [prep  7/10] COPY --chown=65533:65533 [webpack.config.js, ./]                                                                       0.2s
 => [prep  8/10] COPY --chown=65533:65533 [resources, resources/]                                                                       0.2s
 => [prep  9/10] COPY --chown=65533:65533 [src, src/]                                                                                   0.3s
 => [prep 10/10] RUN npm "run-script" "build"                                                                                           8.4s
 => [distribution 1/1] COPY --from=prep [/srv/mediawiki, /dist]                                                                        16.4s
 => copying built components to final image                                                                                            37.8s
 => copying built components to final image                                                                                            10.6s
 => copying built components to final image                                                                                             6.1s
 => copying built components to final image                                                                                            20.1s

On my local system, the build takes a little over 5 minutes with a cold cache. Transfer of the local build context to buildkitd seems to take a considerable amount of time, likely due to the tarring and untarring of around 1G of data for all of core/extensions/skins/vendor. This is an inherent effect of BuildKit's (and Docker's) architecture.

Caching

Caching is done according to the structure of the build graph. As was hoped, structuring subgraphs for Blubber based components as roots allows caching of component build steps independent of each other and the rest of the build graph. This was verified by:

  1. Clearing the cache (buildctl prune)
  2. Executing (and caching) the build step of just extensions/Popups with blubber --format=llb.
  3. Executing flense again in the root of wikimedia/production to build the final image.
$ cd extensions/Popups
$ blubber --format=llb .pipeline/blubber.yaml distribution # the same config/variant specified in the flense manifest.yaml
[+] Building 129.2s (13/13) FINISHED
 => [prep  1/10] FROM docker-registry.wikimedia.org/nodejs10-devel@sha256:ab1cba594bc26b230e888f490f4bcf1e01ebecf725f785e62a2173f85d23  6.5s
 => => resolve docker-registry.wikimedia.org/nodejs10-devel@sha256:ab1cba594bc26b230e888f490f4bcf1e01ebecf725f785e62a2173f85d237f86     0.1s
 => => sha256:e590e63468908ac9727df3a67c671d64cc04df520b6469cea29dd1ff229234c2 2.55kB / 2.55kB                                          0.0s
 => => sha256:5c86276767f395395304e7c56cf42cb728b6ebb4f0c75894c555883276ddbd74 24.14MB / 24.14MB                                        2.7s
 => => sha256:0ea645d5a1950c64eb4f8725ebf17ff13efd4f1f098650b6ec66be05ac667926 26.23MB / 26.23MB                                        2.3s
 => => extracting sha256:5c86276767f395395304e7c56cf42cb728b6ebb4f0c75894c555883276ddbd74                                               1.7s
 => => extracting sha256:e590e63468908ac9727df3a67c671d64cc04df520b6469cea29dd1ff229234c2                                               0.0s
 => => extracting sha256:0ea645d5a1950c64eb4f8725ebf17ff13efd4f1f098650b6ec66be05ac667926                                               2.0s
 => [internal] helper image for file operations                                                                                         0.0s
 => => resolve docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061          0.0s
 => [internal] load build context                                                                                                       0.3s
 => => transferring context: 1.35MB                                                                                                     0.2s
 => [prep  2/10] RUN groupadd -o -g "65533" -r "somebody" && useradd -l -o -m -d "/home/somebody" -r -g "somebody" -u "65533" "somebod  7.5s
 => [prep  3/10] RUN groupadd -o -g "900" -r "runuser" && useradd -l -o -m -d "/home/runuser" -r -g "runuser" -u "900" "runuser"        0.2s
 => [prep  4/10] COPY --chown=65533:65533 [package.json, package-lock.json, ./]                                                         0.2s
 => [prep  5/10] RUN npm install                                                                                                       85.9s
 => [prep  6/10] RUN mkdir -p "resources/" "src/"                                                                                       1.7s
 => [prep  7/10] COPY --chown=65533:65533 [webpack.config.js, ./]                                                                       0.2s
 => [prep  8/10] COPY --chown=65533:65533 [resources, resources/]                                                                       0.2s
 => [prep  9/10] COPY --chown=65533:65533 [src, src/]                                                                                   0.2s
 => [prep 10/10] RUN npm "run-script" "build"                                                                                           6.5s
 => [distribution 1/1] COPY --from=prep [/srv/mediawiki, /dist]                                                                        17.1s
$ cd ../..
$ flense .pipeline/manifest.yaml | buildctl build --local context=.
[+] Building 192.6s (19/19) FINISHED
 => [prep  1/10] FROM docker-registry.wikimedia.org/nodejs10-devel@sha256:ab1cba594bc26b230e888f490f4bcf1e01ebecf725f785e62a2173f85d23  0.1s
 => => resolve docker-registry.wikimedia.org/nodejs10-devel@sha256:ab1cba594bc26b230e888f490f4bcf1e01ebecf725f785e62a2173f85d237f86     0.1s
 => transfer local context for (193) copy components                                                                                   90.5s
 => => transferring context: 944.35MB                                                                                                  90.2s
 => transfer local context for component (extensions/Popups)                                                                            0.9s
 => => transferring context: 4.65MB                                                                                                     0.8s
 => CACHED [internal] helper image for file operations                                                                                  0.0s
 => => resolve docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061          0.9s
 => preparing component (extensions/Popups) for blubber build                                                                           0.4s
 => CACHED [prep  2/10] RUN groupadd -o -g "65533" -r "somebody" && useradd -l -o -m -d "/home/somebody" -r -g "somebody" -u "65533" "  0.0s
 => CACHED [prep  3/10] RUN groupadd -o -g "900" -r "runuser" && useradd -l -o -m -d "/home/runuser" -r -g "runuser" -u "900" "runuser  0.0s
 => CACHED [prep  4/10] COPY --chown=65533:65533 [package.json, package-lock.json, ./]                                                  0.0s
 => CACHED [prep  5/10] RUN npm install                                                                                                 0.0s
 => CACHED [prep  6/10] RUN mkdir -p "resources/" "src/"                                                                                0.0s
 => CACHED [prep  7/10] COPY --chown=65533:65533 [webpack.config.js, ./]                                                                0.0s
 => CACHED [prep  8/10] COPY --chown=65533:65533 [resources, resources/]                                                                0.0s
 => CACHED [prep  9/10] COPY --chown=65533:65533 [src, src/]                                                                            0.0s
 => CACHED [prep 10/10] RUN npm "run-script" "build"                                                                                    0.0s
 => CACHED [distribution 1/1] COPY --from=prep [/srv/mediawiki, /dist]                                                                  0.0s
 => copying built components to final image                                                                                            40.4s
 => copying built components to final image                                                                                             9.7s
 => copying built components to final image                                                                                             5.4s
 => copying built components to final image                                                                                            18.9s

Conclusion

The flense tool as a proof of concept has potential, but its reliance on BuildKit also comes with some inherent risks.

Benefits
  1. The basic requirement of getting a single-version MediaWiki image built for production use was accomplished.
  2. Delegation to Blubber configurations in component repos means giving teams the ability to define their own build steps. (This also has risks.)
  3. Atomicity of subgraphs for Blubber based components means that teams can have their build steps executed (and cached) prior to the final image being built (post merge).
  4. The implementation is general enough to potentially satisfy other use cases (such as public MediaWiki image releases).
  5. BuildKit supports distributed caches and workers which may be helpful in provisioning a build cluster.
Risks/drawbacks
  1. BuildKit is still quite new, and there are bugs. A bug that resulted in intermittent timeouts was discovered (filed, and worked around) during this experiment.
  2. Transfer of the local build context is slow. However, this is likely to apply to Docker based builds as well.
  3. It's another home-grown tool to maintain.

Event Timeline

dduvall triaged this task as Medium priority.Nov 20 2020, 10:16 PM
dduvall created this task.

Change 642561 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[mediawiki/extensions/Popups@master] experimental: Define a blubber variant for distribution build step

https://gerrit.wikimedia.org/r/642561

Change 642565 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[wikimedia/production@master] experimental: Include blubber flense manifest

https://gerrit.wikimedia.org/r/642565

Change 640562 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[blubber@master] experimental: flense command for aggregating blubber LLB graphs

https://gerrit.wikimedia.org/r/640562

Change 640562 abandoned by Dduvall:
[blubber@master] experimental: flense command for aggregating blubber LLB graphs

Reason:
See associated task for experiment summary.

https://gerrit.wikimedia.org/r/640562

Change 642565 abandoned by Dduvall:
[wikimedia/production@master] experimental: Include blubber flense manifest

Reason:
See associated task for experiment summary.

https://gerrit.wikimedia.org/r/642565

Change 642561 abandoned by Dduvall:
[mediawiki/extensions/Popups@master] experimental: Define a blubber variant for distribution build step

Reason:
See associated task for experiment summary.

https://gerrit.wikimedia.org/r/642561