Page MenuHomePhabricator

Sketch MediaWiki production image composition and workflows
Open, MediumPublic

Description

Provide a rough sketch of how container images might be composed (and how builds are triggered and by what events) for WMF production MediaWiki including its extensions, skins, vendor directory, l10n CDB files, and security patches.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
Opendduvall
Opendancy
Resolveddduvall
Opendancy
Opendduvall
Resolveddduvall
Resolveddduvall
OpenNone
StalledNone
OpenNone
Openmmodell
Invaliddduvall
Opendduvall
OpenNone
OpenNone
Opendancy
OpenNone
Openakosiaris

Event Timeline

dduvall created this task.Aug 6 2020, 5:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 6 2020, 5:05 PM
dduvall claimed this task.Aug 6 2020, 5:14 PM
dduvall triaged this task as Medium priority.
dduvall added a subscriber: jeena.Aug 18 2020, 11:37 PM

@jeena and I met today to review the following rough sketch. It's a work in progress, but the basic ideas are:

  1. Rely on existing infrastructure and tooling to shorten the path to an MVP:
    1. Docker registry, although we'll need something private in addition to our existing registry which is public.
    2. PipelineLib which can already orchestrate image building, testing, and publishing.
    3. Blubber which can help us produce efficient and security conforming images for production.
  2. Facilitate build-step processes for MW core/extensions/skins during gate-and-submit that are needed by Platform Engineering (see T257582: Work with RelEng to add PipelineBot to Vector).
  3. Store results of build-step operations for core/extensions/skins as intermediate scratch based images that contain only what is needed at production runtime for each component, including build artifacts.
    1. Using scratch images results in archives that are comparable in size to tarballs, we can read/write them with existing Docker-based tooling, and will be cached on build machines in existing Docker image caches.
    2. Blubber can easily be modified to produce scratch images (see https://gerrit.wikimedia.org/r/c/blubber/+/619072)
    3. These images should contain only what is needed at production runtime.
    4. A convention could be adopted whereby the build script for each component is expected to put the desired production file hierarchy into a dist/ directory which would be copied into the scratch image.
  4. Integrate distribution images into a combined MediaWiki image according to the current weekly deploy model, but leave room for faster deploy cycles down the road.
    1. Produce both releaseable and deployable images.
    2. Compile l10n updates during integration.
    3. Apply security updates for deployable image only.

Some open questions and follow-up from our discussion includes:

  • Where do security patches come from?
    • Currently live in Phabricator tasks as attached files
    • And on the deployment server
    • Might it be better to have them in additional/restricted Gerrit/Git branches for each project (based on wmf/ branch)?
      • CI pipeline that builds production variant could access branches and integrate them into distribution images
      • Would require restricted registry.
      • If .pipeline/config.yaml is defining this process, it might allow for exposure of patches. Constrain the process?
  • It's unfortunate that l10n compilation has to be done post integration. Is this really the case? Let's confirm
    • Check whether it would help to compile localization for each language in parallel
  • Leaves out MW configuration. SRE is figuring that out.
  • Currently assumes a weekly branch cut or some degree of periodicity.
  • Let's work on a short presentation!

Inviting @akosiaris and @Joe to this and related tasks for comment.

Naike moved this task from Backlog to In Progress on the MW-on-K8s board.Sep 8 2020, 3:58 PM
mmodell added a subscriber: mmodell.Sep 9 2020, 5:19 PM
Joe added a comment.EditedSep 15 2020, 6:50 AM

Let me start by clarifying a couple things:

  • SRE is not working on the configuration of MediaWiki. AIUI, our expectation is that Release Engineering and Dev teams would work on it.
  • scratch images are ok only if it's just for intermediate build steps.

Apart from this, it seems to me we're reproducing, in the process above, the code structure and the procedures we already have. I think if we do that, we would be missing a huge opportunity.
More explicitly: I think trying to solve the problem of creating a MediaWiki container should only be tackled after we've reorganized the code that we release.

On the long run, I think what makes sense is that we start having the concept of a "release repository" - probably private - that gets automatically updated:

  • When a configuration that is still in code changes
  • When a new version is released

This means that current backports will need to bump the production version by a minor (so 1.36.0-wmf.31 would become 1.36.0-wmf.31-1 and so on).

So imagine that, upon marking a release version or merging a configuration change, a CI process commits all the changes necessary to this release repository. This release repository would be the one having the pipeline configuration for basing your image on top of the php-fpm one.

Thanks for the feedback, @Joe.

Let me start by clarifying a couple things:

  • SRE is not working on the configuration of MediaWiki. AIUI, our expectation is that Release Engineering and Dev teams would work on it.

A few of us in RelEng are likely game to work on it. Seems like between the MediaWiki on K8s doc you started and the discussion on mediawiki.org we have a good basis to start from.

  • scratch images are ok only if it's just for intermediate build steps.

Yes, that's the idea here. And the scratch images can be heavily garbage collection—if the HEAD of any MW repo for a subsequent release cycle is different from that of the previous cycle, we can delete the previous scratch images.

Apart from this, it seems to me we're reproducing, in the process above, the code structure and the procedures we already have. I think if we do that, we would be missing a huge opportunity.
More explicitly: I think trying to solve the problem of creating a MediaWiki container should only be tackled after we've reorganized the code that we release.

Could you expand on what you mean by "reorganize the code that we release?" Is there a specific goal towards doing this, or some kind of concept you can point me to? The idea here is definitely to move us toward containerization using the code structure we already have for the sake of achievability in the short term. After we have a functional pipeline, we can always iteratively refactor the way we structure our codebase. But if there's a better way we can structure the deployable code and people are on board to do that, we can certainly adapt this idea to suit it.

I disagree that this is the same process we use now. It seems quite different to me in that it introduces a build step and seeks to minimize the distributable portions of each MW component after such a build step is completed. The integration workflow is also quite different is that it no longer depends on pulling in source trees at deploy time. There are similarities, however, to our weekly train process in the way security patches and l10n cache would be handled, but I'm totally open to better ideas there if we can solve the current constraints.

On the long run, I think what makes sense is that we start having the concept of a "release repository" - probably private - that gets automatically updated:

  • When a configuration that is still in code changes

A configuration that is still in code? Like something defined in mediawiki-config? Or elsewhere?

  • When a new version is released

This means that current backports will need to bump the production version by a minor (so 1.36.0-wmf.31 would become 1.36.0-wmf.31-1 and so on).

Yes, that's the idea here as well. It's starting with the cadence we use now but with the idea to greatly increase the frequency and incorporate BACON (backport) changes. I agree the versioning should support that requirement.

So imagine that, upon marking a release version or merging a configuration change, a CI process commits all the changes necessary to this release repository. This release repository would be the one having the pipeline configuration for basing your image on top of the php-fpm one.

Putting the pipeline configuration in mediawiki-config makes sense to me.

Joe added a comment.Sep 23 2020, 2:59 PM

To further clarify, I think the main question would be transforming a good portion of the workflow above in actual git commits to a repository. It's then trivial to compose a docker image that includes that git repo.

But more to the point: I'm also worried about the image size. How big would this image end to be? Do you have an order-of-magnitude size estimation?

I ask because currently /srv/mediawiki on an average appserver is ~ 27 GB, and that would definitely not be a container size that would make sense for us.

To further clarify, I think the main question would be transforming a good portion of the workflow above in actual git commits to a repository. It's then trivial to compose a docker image that includes that git repo.

Do you mean using git repos to store production runtime code and build artifacts as opposed to storing that in docker registries?

In my thinking, a binary store is in general more suitable for build artifacts, and using docker images can offer us some specific benefits such as caching of previously built layers—cutting down on build time within the pipeline and data transfer during deployment. Git doesn't offer much benefit as an artifact store and comes with the overhead of tracking history which doesn't seem necessary in this context, and if we delay building of an image filesystem until we have a fully integrated MW filesystem in a git tree, we're losing out on the possible benefits of cacheable image layers—though that exact benefit is only theoretical, not yet proven. :)

But more to the point: I'm also worried about the image size. How big would this image end to be? Do you have an order-of-magnitude size estimation?

I ask because currently /srv/mediawiki on an average appserver is ~ 27 GB, and that would definitely not be a container size that would make sense for us.

What are the biggest concerns about disk space? Is it with regard to registry storage or more about running pods? For the former, I'm imagining aggressive garbage collection FWIW—again, only theoretical at this point which is why I'm created subtasks for experimentation.

Looking at a single recent release (1.36.0-wmf.9), the size breakdown of the ten largest subdirectories is:

$ du -s /srv/mediawiki/php-1.36.0-wmf.9/* | sort -rnk 1 | head -n 10
5796756	/srv/mediawiki/php-1.36.0-wmf.9/cache
562820	/srv/mediawiki/php-1.36.0-wmf.9/extensions
418144	/srv/mediawiki/php-1.36.0-wmf.9/vendor
68508	/srv/mediawiki/php-1.36.0-wmf.9/languages
30460	/srv/mediawiki/php-1.36.0-wmf.9/includes
17972	/srv/mediawiki/php-1.36.0-wmf.9/skins
17864	/srv/mediawiki/php-1.36.0-wmf.9/tests
16832	/srv/mediawiki/php-1.36.0-wmf.9/resources
4596	/srv/mediawiki/php-1.36.0-wmf.9/maintenance
1128	/srv/mediawiki/php-1.36.0-wmf.9/HISTORY

The vast majority of the disk-space requirement comes from the l10n caches. As the workflow is currently defined, those would be included in the final image. I'm not sure if we can avoid those eventually being in the running pods (and thus taking up disk space) but if the concern is more about registry disk usage, they could definitely pile up. I think this is again where the benefits of image layer caching might come in handy, depending on how we compose the image dependency tree—thinking in terms of BuildKit LLB graph structures here, not flat Dockerfiles.