Page MenuHomePhabricator

In a k8s world: where does MediaWiki code live?
Closed, ResolvedPublic

Description

In the MediaWiki on k8s initial proposal document @Joe mentioned a phase of the transition during which MediaWiki's PHP code would be mounted to a php-fpm container running under k8s as a volume rather than built into the container.

There are a few unanswered questions and clarifications in the thread from the Google document; hence this task.

Some things needing clarity:

  • When a MediaWiki pipeline image becomes available how easy will it be to transition/is the intention to transition?
  • @dduvall and/or @jeena are you objecting to having the code mounted as a volume as a transition step? Or as a final step?
  • How long is this transition period?

Event Timeline

My concern is that this transition step becomes a permanent step.

From a delivery point of view, having an immutable artifact which represents what we intend to deploy is advantageous. Deployment is simpler, there are fewer Kubernetes resources required, and it’s easier to track what is running in production. I don’t see an advantage to not shipping the code in the image, and if it’s about avoiding tackling technical debt in regards to running build steps at deploy time, I think it’s better do that hard work rather than settle.

My concern is that this transition step becomes a permanent step.

It's up to us to avoid that being true, in case, but I have further doubts I will clarify below. And yes, I clarified in a meeting with our managers I have that concern too (after all, Starling's first law still holds), and they swore commitment to repay that debt once we've transitioned traffic.

From a delivery point of view, having an immutable artifact which represents what we intend to deploy is advantageous. Deployment is simpler, there are fewer Kubernetes resources required, and it’s easier to track what is running in production. I don’t see an advantage to not shipping the code in the image, and if it’s about avoiding tackling technical debt in regards to running build steps at deploy time, I think it’s better do that hard work rather than settle.

here are a series of unsolved problems we need to solve before I'm comfortable with deciding to go with "code in the image":

  • We need to understand how large such an image would become, and what strain would put on our infrastructure to try to pull it from 100s of servers at the same time. We might need to be creative about how we distribute it in production (bittorrent comes to mind). This is tracked in T264209
  • If that's not a problem, we would still have the disk space utilization issue on the k8s workers, but that seems like a minor concern, in comparison. However high iops would affect other services running on the same servers (the "noisy neighbours" problem).
  • How do we manage things that are automatic deployments right now? We would need to not only publish the new image to the registry from CI, but also run some script on a server to actually cause the deployment to happen. Not sure if that's the same thing that happens right now for l10nupdates, but this basically means we'd need a way to automatically merge changes to our deployment-chart repo. I'm not sure that's a great idea security-wise

but more importantly, we need to keep in mind the transition phase.

  • We will need to keep the experience of deploying to production non-miserable. Also we want all traffic to be served by the same version of MediaWiki as much as possible. So we don't want two different deployment systems, nor to have to make sure CI ran the build and publish step after we've merged a change and we can deploy it.
  • It would be relatively easy to add a "roll restart k8s deployments" stage to scap so that it's mostly transparent to the deployer, so that it can also work when people just want to change one file
  • I don't think the current model of the backport window would work well with a CI build step that will need to happen for each merge.

In conclusion, I think we should act as follows:

  • Figure out the expected size of a container with all the stuff needed to run MediaWiki
  • Answer the questions about its feasibility by running the benchmarks in T264209

If we get positive answers there, then we need to:

  • Figure out a better way to ship configuration that is k8s-compatible and doesn't require as many deployments in normal operating conditions
  • Finish the transition of live traffic to k8s
  • Switch to code-in-the-image once we're convinced we can convert all processes we have, automated or not

TL;DR: I see "code-in-the-image" as the last step of the migration, if all goes according to plan. Doing so will need quite some engineering effort both on the deployment part and no the image building part. There is one big feasibility question mark that will need to be answered first, and I think that while the transition is ongoing we should just go with distributing the code with scap for the time being.

Mentioned in SAL (#wikimedia-operations) [2020-10-14T14:01:29Z] <akosiaris> push a 6GB image, named docker-registry.discovery.wmnet/mwcachedir:0.0.1, containing the cache/ dir of a mediawiki installation to the registry. T265183

From a delivery point of view, having an immutable artifact which represents what we intend to deploy is advantageous. Deployment is simpler, there are fewer Kubernetes resources required, and it’s easier to track what is running in production. I don’t see an advantage to not shipping the code in the image, and if it’s about avoiding tackling technical debt in regards to running build steps at deploy time, I think it’s better do that hard work rather than settle.

here are a series of unsolved problems we need to solve before I'm comfortable with deciding to go with "code in the image":

  • We need to understand how large such an image would become, and what strain would put on our infrastructure to try to pull it from 100s of servers at the same time. We might need to be creative about how we distribute it in production (bittorrent comes to mind). This is tracked in T264209
  • If that's not a problem, we would still have the disk space utilization issue on the k8s workers, but that seems like a minor concern, in comparison. However high iops would affect other services running on the same servers (the "noisy neighbours" problem).

I'm hoping the results of T260828: Experiment with PipelineLib/Blubber driven MediaWiki container image pipeline will also help to answer some of these specific questions around overall image size. It's important to note that overall disk utilization and network IOPS will be greatly impacted by the way images are composed—the local cache-ability of intermediate image layers. We can't really know what this will look like without experimenting.

  • How do we manage things that are automatic deployments right now? We would need to not only publish the new image to the registry from CI, but also run some script on a server to actually cause the deployment to happen. Not sure if that's the same thing that happens right now for l10nupdates, but this basically means we'd need a way to automatically merge changes to our deployment-chart repo. I'm not sure that's a great idea security-wise

I'm not aware of anything other than l10n-updates which has been disabled for a while now—with no real plans to reinstate it. Are there others?

  • We will need to keep the experience of deploying to production non-miserable. Also we want all traffic to be served by the same version of MediaWiki as much as possible. So we don't want two different deployment systems, nor to have to make sure CI ran the build and publish step after we've merged a change and we can deploy it.

I think MediaWiki version consistency is crucial. I'm glad you're highlighting that. I think the requirement @jeena brought up about having code come from an "immutable artifact" is also about achieving determinism and consistency to a great extent.

But aren't we talking about introducing a second deployment system either way? If we're not going to package code into images, where does it come from? Where is it prepared? How does it get transferred to shared volume(s) mounted by running pods? Unless the answer to all of these questions is the system we have now (release tools, deployment server, scap prep, scap sync) we're setting up a parallel one. Why not make the future deployment system congruent with native k8s things and hack the old system to serve as the transitionary one?

To propose one alternative straw-man solution for a transitionary deployment system: We could adapt scap prep to unpack MediaWiki container images into /srv/mediawiki-staging/php-{version} on the deployment host. Since images are just tarballs with metadata, it's not that complicated a task, and we'd end up with a filesystem tree similar to what we have now (minus the .git objects), and more importantly one that scap sync and friends already know how to deploy to mw servers. This kind of transitionary system would require some changes to human workflows (BACON deploys come to mind), but it would also yield some important benefits:

  1. The MediaWiki code we'd be deploying would come from a single immutable source (an image).
  2. Since the transitionary cruft is all in scap, there's a clear(er) path to deprecation.
  3. The distribution format (images) is more congruent with the system we want to be using and supporting long-term (k8s).
  • It would be relatively easy to add a "roll restart k8s deployments" stage to scap so that it's mostly transparent to the deployer, so that it can also work when people just want to change one file
  • I don't think the current model of the backport window would work well with a CI build step that will need to happen for each merge.

Adapting scap to initiate k8s deployments during transition makes sense to me as well, and it could work well in tandem with a scap prep that unpacks images.

Changes to the backport deployment process would definitely be needed. I'm not sure I agree that it wouldn't work well with a build step. It really depends on how distributed and cacheable those intermediate build steps can be, and we won't know that without running some experiments—this is another question I hope to answer in T260828: Experiment with PipelineLib/Blubber driven MediaWiki container image pipeline.

TL;DR: I see "code-in-the-image" as the last step of the migration, if all goes according to plan. Doing so will need quite some engineering effort both on the deployment part and no the image building part. There is one big feasibility question mark that will need to be answered first, and I think that while the transition is ongoing we should just go with distributing the code with scap for the time being.

There seems to be disagreement about priority, but I also think it stems from a shared goal—in other words, we're working from opposite (but complementary) sides of the same problem. Solving the build and deployment parts of the equation is squarely within our concerns and goals for the coming quarters. It's a large puzzle with many unknowns, so the more we can leave room for experimentation and interpretation of results, the better.

My concern is that this transition step becomes a permanent step.

It's up to us to avoid that being true, in case, but I have further doubts I will clarify below. And yes, I clarified in a meeting with our managers I have that concern too (after all, Starling's first law still holds), and they swore commitment to repay that debt once we've transitioned traffic.

I have the same concern and the question is how are we going to repay that debt. If we don't have a plan already, even an incomplete one, I fear there will be little incentive to work on one later on and I am willing to bet that the resources devoted to repay that debt will be less than the resources allocated now.

From a delivery point of view, having an immutable artifact which represents what we intend to deploy is advantageous. Deployment is simpler, there are fewer Kubernetes resources required, and it’s easier to track what is running in production. I don’t see an advantage to not shipping the code in the image, and if it’s about avoiding tackling technical debt in regards to running build steps at deploy time, I think it’s better do that hard work rather than settle.

here are a series of unsolved problems we need to solve before I'm comfortable with deciding to go with "code in the image":

  • We need to understand how large such an image would become, and what strain would put on our infrastructure to try to pull it from 100s of servers at the same time. We might need to be creative about how we distribute it in production (bittorrent comes to mind). This is tracked in T264209

Yeah, already on it, I am updating that task. We got some bottlenecks already but not too hard to address for now.

  • If that's not a problem, we would still have the disk space utilization issue on the k8s workers, but that seems like a minor concern, in comparison. However high iops would affect other services running on the same servers (the "noisy neighbours" problem).

I am not so worried about the space issue to be honest. With the docker layered approach, the GC of images done by the kubelet and the fact nodes have >800GB we are probably ok. The IOPS side might indeed show up, but I am willing to say that this is a bridge to cross when we get there.

  • We will need to keep the experience of deploying to production non-miserable. Also we want all traffic to be served by the same version of MediaWiki as much as possible. So we don't want two different deployment systems, nor to have to make sure CI ran the build and publish step after we've merged a change and we can deploy it.

Agreed on the UX part, however I think we should do 2 different deployment systems that for the duration of the transition are somewhat hidden under scap. This ties up to Dan's proposal.

  • It would be relatively easy to add a "roll restart k8s deployments" stage to scap so that it's mostly transparent to the deployer, so that it can also work when people just want to change one file

Changing just 1 file is a pattern that we should revisit I think. Especially if we want our artifacts to mean anything version wise.

  • I don't think the current model of the backport window would work well with a CI build step that will need to happen for each merge.

Agreed. In fact my idea is more on a time based and not a per change based CI. That is build a set of images once per X hours (12? 24? tbd) and use those instead.

In conclusion, I think we should act as follows:

  • Figure out the expected size of a container with all the stuff needed to run MediaWiki
  • Answer the questions about its feasibility by running the benchmarks in T264209

If we get positive answers there, then we need to:

  • Figure out a better way to ship configuration that is k8s-compatible and doesn't require as many deployments in normal operating conditions

Yes. +1

  • Finish the transition of live traffic to k8s

I am not sure this specific step should be above the step below though. Quite the contrary.

  • Switch to code-in-the-image once we're convinced we can convert all processes we have, automated or not

TL;DR: I see "code-in-the-image" as the last step of the migration, if all goes according to plan. Doing so will need quite some engineering effort both on the deployment part and no the image building part. There is one big feasibility question mark that will need to be answered first, and I think that while the transition is ongoing we should just go with distributing the code with scap for the time being.

To propose one alternative straw-man solution for a transitionary deployment system: We could adapt scap prep to unpack MediaWiki container images into /srv/mediawiki-staging/php-{version} on the deployment host. Since images are just tarballs with metadata, it's not that complicated a task, and we'd end up with a filesystem tree similar to what we have now (minus the .git objects), and more importantly one that scap sync and friends already know how to deploy to mw servers. This kind of transitionary system would require some changes to human workflows (BACON deploys come to mind), but it would also yield some important benefits:

I like this idea. That could allow us to move toward the code in the image approach while maintaining the current UX.

@dduvall I like the idea of using scap prep to extract the code from the images, I didn't think of inverting the logic like that but it's surely workable.

I have one doubt though: we need the image to support all current versions of the code, at least for now, and to include mediawiki-config as well. Your comment seems to imply the images will have just one version of MediaWiki, and that might not be possible - at least initially we're not planning to separate group0/1/2 at the moment, it requires quite a lot of work on possibly multiple levels.

Apart from that, the wonderful work @akosiaris has done on the registry scaling task answered some questions we had: we will need to work on infrastructure before we can push this to 100s of k8s workers, but there are ways we might use to do so.

@dduvall I like the idea of using scap prep to extract the code from the images, I didn't think of inverting the logic like that but it's surely workable.

Glad to hear that! It seemed a bit crazy in my head, so I was surprised when you and others said it sounded viable. :)

I have one doubt though: we need the image to support all current versions of the code, at least for now, and to include mediawiki-config as well. Your comment seems to imply the images will have just one version of MediaWiki, and that might not be possible - at least initially we're not planning to separate group0/1/2 at the moment, it requires quite a lot of work on possibly multiple levels.

In my experimenting with MW image builds, I'm limiting the scope to a single MW version. That was done in part to focus the experiment, but I also think it makes sense from an integration perspective whereby smaller MW components are incrementally built and then combined to form a larger and larger installation.

I think if we have multiple single-version images being kept up-to-date, we can then combine them into an aggregate image just before deployment. This seems like it would be an efficient build step, in theory at least, as it would only involve a few COPY --from instructions. However, it's yet another thing that we'd want to research to answer questions around cache-ability network/disk overhead. Diagramming a sequence and build graph of how this would work seems like a helpful step too. I'll try to draw up something for the next pipeline meeting.

Apart from that, the wonderful work @akosiaris has done on the registry scaling task answered some questions we had: we will need to work on infrastructure before we can push this to 100s of k8s workers, but there are ways we might use to do so.

That's awesome to hear! I'm excited to dig more into what @akosiaris did there. I've only have a chance to briefly look at it.

Joe claimed this task.