Page MenuHomePhabricator

Perform l10n cache rebuild using initContainers instead of including it in the image
Closed, DeclinedPublic

Description

Background

The l10n cache makes up the largest portion of the published multiversion MediaWiki image. If we can avoid building the cache during the image build and instead do it at deploy time, we can avoid a substantial amount of computational cost at build, registry storage requirements, and network i/o during k8s scheduling.

Proposal

To efficiently generate the cache at deploy time, we would need:

  1. A PersistentVolume defining local storage at a path on the node for the l10n caches that would be shared by all pods on each node. It would need to be large enough to keep 2-3 sets of caches, one for each version of MW deployed (which are 4+ Gb each but let's round way up to ~ 20 Gb total).
  2. A PersistentVolumeClaim defined in the chart or elsewhere that claims this local l10n PV for MW deployments. Note that if the PV is statically provisioned and defined ReadWriteMulti, only one PVC should be needed.
  3. An initContainer defined in the chart's pod template that mounts the local storage PV during scheduling and runs the rebuildLocalisationCache maintenance script for all wikis. An flock on the PV would be used to ensure only one pod is running the rebuild per node.
  4. The PV is mounted by the main container giving the MW runtime access to the l10n cache files (r/o if possible though I ran into strange issues having the same volume r/w for the init container and r/o for the main container).

This idea is a rip off of what had been experimented with previously by @dancy and @jeena. However, the previous dependency on a shared hostPath volume between pods on the same node raised some security concerns which I believe this approach avoids.

Proof of Concept

I've developed a small proof of concept around this idea using a local k8s cluster. Please see https://gist.github.com/marxarelli/3719068d447503800565dccda3154bb2 for implementation.

Need for feedback

This proposal needs serviceops feedback as it would rely on them to manage/provision the local storage PV.

Event Timeline

I have one doubt about the idea of using persistent local volumes... that would mean tying pods to specific nodes, and I'm not 100% sure that's a great idea. Also, I need to verify if it's possible to mount a persistent volume into multiple pods.

I think the basic idea would be, similar to hostPath, to basically have one PV per node. So this would not restrict Pods to specific nodes. Anyways I do think this is not possible currently as local volumes only support ReadWriteOnce access mode[1]. And that would mean one PV per Pod.
Honoring the comment in the quoted source, maybe it *is* possible to RWX local volumes, but I could not find proof by now.

Are the security concerns of using hostPath outlined somewhere?
A concern I have regarding a PVC/PV solution is that we currently have this feature set disabled in kubernetes (thus we have no experience with it). I'm pretty sure there are some dark areas here (as there are always dark areas when it comes to persistence :)).

[1] https://github.com/kubernetes/kubernetes/blob/ca643a4d1f7bfe34773c74f79527be4afd95bf39/pkg/volume/local/local.go#L98-L103

The problem is that we'd be forced to mount hostPath as 'read-write' in all pods and allow the first one that gets there to recreate the l10n cache in the hostPath. And those are indeed considered a security libability as it allows leakage of the host filesystem into the pod, see the considerations on hostPath security here: https://kubernetes.io/docs/concepts/storage/volumes/, but I don't think this specific case would cause enormous harm.

The whole system seems quite brittle to be though, as we won't be sure every host has populated the l10n cache before startup, and I'm not sure how we'll make sure that pods for a specific version won't be declared ready until the whole cache rebuild process is completed.

Probably using a presync hook in helmfile (or hooks in helm) to run a job that populates the cache before the release is more solid and would ensure it happens everywhere before the release is done, It would also allow us to mount the hostPath as read-only.

But this is still enormously wasteful. I would really prefer us to build the l10n cache at build time and instead try to get free of multiversion, so that the single mediawiki images would be 1/2 or 1/3 of the current size.

I would agree that adding PV(C) stuff potentially makes thinks way more complicated then they would be using a hostPath.
initContainers could wait on the lock being released by the container one who acquired it, so the actual mw-containers won't start until the cache is populated successfully. So that might be an option if we need to do that from inside kubernetes/the helm chart.
But as Joe said, this is pretty wasteful as we'd have to do that on every node per version instead of just once per version. Also, this leaves us with mediawiki images that can't actually be run without further interaction which kind of beaks with the idea of the docker images being self-contained.

I have one doubt about the idea of using persistent local volumes... that would mean tying pods to specific nodes, and I'm not 100% sure that's a great idea. Also, I need to verify if it's possible to mount a persistent volume into multiple pods.

Honoring the comment in the quoted source, maybe it *is* possible to RWX local volumes, but I could not find proof by now.

In the proof of concept I linked to, I found that it is possible to get away with a single RWX PV and PVC as long as the PV matches all possible nodes in its nodeAffinity stanza. The pod template references the PVC by name and the volume(s) are considered relative to the node on which the pod is scheduled (verified in my tests).

The only somewhat weird complication here is that if the PVC is ever released the PV would need to be manually reclaimed. For that reason, we would either want to 1) have the PVC provisioned outside of the chart and long lived; or 2) verify that helm would not modify the resources unnecessarily upon chart upgrades (something @jeena is looking into).

The problem is that we'd be forced to mount hostPath as 'read-write' in all pods and allow the first one that gets there to recreate the l10n cache in the hostPath. And those are indeed considered a security libability as it allows leakage of the host filesystem into the pod, see the considerations on hostPath security here: https://kubernetes.io/docs/concepts/storage/volumes/, but I don't think this specific case would cause enormous harm.

There are other issues related to hostPath. For example, fsGroup (ensuring certain group ownership) is not implemented for it, so all files end up owned root:root. (The only workaround I have found for this is to do chown in a privileged init container which is of course terrible.) This was another one of the issues with the previous attempt to use initContainers to generate the caches. The local storage PVs do not have this issue. (See the mediawiki.yaml deployment in the proof of concept.)

The whole system seems quite brittle to be though, as we won't be sure every host has populated the l10n cache before startup, and I'm not sure how we'll make sure that pods for a specific version won't be declared ready until the whole cache rebuild process is completed.

The proof of concept uses an flock to ensure only a single process is generating the cache and the other pods' initContainers wait for completion before briefly running rebuildLocalisationCache again to verify the generated files already exist.

Probably using a presync hook in helmfile (or hooks in helm) to run a job that populates the cache before the release is more solid and would ensure it happens everywhere before the release is done, It would also allow us to mount the hostPath as read-only.

I think the only issue I heard from my team on this was that it seemed more brittle than using an init container. :) After hearing them out, I think I agree, though I'm not opposed to looking into the job approach further. What I found right off after mentioning it was that there doesn't seem a built in way to schedule a Job to all nodes but you could do it with a DaemonSet hack or there are third-party CRD implementations for it.

But this is still enormously wasteful. I would really prefer us to build the l10n cache at build time and instead try to get free of multiversion, so that the single mediawiki images would be 1/2 or 1/3 of the current size.

I would agree that adding PV(C) stuff potentially makes thinks way more complicated then they would be using a hostPath.
initContainers could wait on the lock being released by the container one who acquired it, so the actual mw-containers won't start until the cache is populated successfully. So that might be an option if we need to do that from inside kubernetes/the helm chart.
But as Joe said, this is pretty wasteful as we'd have to do that on every node per version instead of just once per version. Also, this leaves us with mediawiki images that can't actually be run without further interaction which kind of beaks with the idea of the docker images being self-contained.

It is computationally expensive only during the first deployment of a new MW version where l10n files for the new version don't exist yet (so once every Tuesday). On subsequent deployments involving the same MW versions, rebuildLocalisationCache shouldn't be doing much as the existing l10n cache files for the given versions would still exist.

A solution where n jobs are scheduled to each existing node would be the same computation wise. And while this is a computationally expensive task that could be done once in a single place, it is less expensive in registry storage and network I/O (and deployment time).

I definitely want to be free of multiversion too, but I see this as a viable solution to the l10n problem to use in the interim. Also, it should be said that our current system of deploying l10n cache files is not great computation wise. Scap generates l10n CDBs on the deploy host, then transforms them into JSON, then on every target host it turns them back into CDBs. We seem to handle that ok.

It is computationally expensive only during the first deployment of a new MW version where l10n files for the new version don't exist yet (so once every Tuesday). On subsequent deployments involving the same MW versions, rebuildLocalisationCache shouldn't be doing much as the existing l10n cache files for the given versions would still exist.

A solution where n jobs are scheduled to each existing node would be the same computation wise. And while this is a computationally expensive task that could be done once in a single place, it is less expensive in registry storage and network I/O (and deployment time).

Sure, there is no difference in running initContainers vs. Jobs (computation wise), I did not meant to argue in that direction.
As we're already actively working on the network I/O side of this: Do you have taken measurements about how much generating l10n before deployments is faster than transferring the pre-computed one it over the network? I'm trying to understand how big the gain is there.

I definitely want to be free of multiversion too, but I see this as a viable solution to the l10n problem to use in the interim. Also, it should be said that our current system of deploying l10n cache files is not great computation wise. Scap generates l10n CDBs on the deploy host, then transforms them into JSON, then on every target host it turns them back into CDBs. We seem to handle that ok.

Forgive me if I'm being ignorant here. But if we have a working solution for l10n distribution, why not use that to distribute to the k8s nodes as well and just hostPath mount the directory then? In that case the files being owned by root/fsGroup not being implemented should not be an issue either, right? From my naive POV adding a new/another way of generating l10n at deploy time is of no real benefit (or I fail to see it) but OTOH doing it would add more code to the helm charts, increase complexity of the deploy process as well as the k8s installation (we'd have to add/enable persistent volume support in general) and also potentially requires deployers to debug l10n generation/distribution issues in a setup that is bound to go away anyways (whereas they potentially know already how to debug it in its current, scap based way).

As we're already actively working on the network I/O side of this: Do you have taken measurements about how much generating l10n before deployments is faster than transferring the pre-computed one it over the network? I'm trying to understand how big the gain is there.

I don't have exact figures for network transfer, but I will look at at. The l10n cache layer of the uncompressed image is roughly 2/3 of the total image when there is only one MW version present, a little more of the total size when there are two versions. I'll take a look at the compressed version as well.

Closing this out with relevant updates from the discussion during our RelEng/SRE IC meeting today.

  • The current images take approximately 5 min+ to transfer to all nodes which is unacceptable and scap based deployment is/was seen as one possible workaround for now.
  • We have ideas to reduce the image build, one being the initContainer based l10n cache rebuild.
  • However, the incremental image build process @dancy has developed (T286505) reduces the subsequent image layer sizes to such a large degree that it makes the l10n issue moot for now. There may be scenarios where incremental code changes result in a large subsequent l10n rebuild layer but these scenarios are likely rare and not worth optimizing at the moment. If they ever become a problem, we can revisit this or other proposals for deferring the l10n cache rebuild until deployment.
  • The incremental image build process should allow us to move forward with image based code deployments.