Page MenuHomePhabricator

Improve performance of deployment to mw on k8s
Closed, ResolvedPublic

Description

While full image build (should usually only happen for train deployments and/or l10n updates) will take a long time, and I think that is generally acceptable, most of the times even a simple backport deployment can take up to 20 minutes on k8s at the moment.

Specifically:

  • the image build/publish is fast (< 1 minute)
  • the deployment are randomly varying between very fast (< 1 minute) and slow (~ 8 minutes)

The reason for the occasional slowness is that sometimes we allocate a pod to a k8s node that hasn't downloaded the "fat" base image before and needs to download and extract a 7 GB set of image layers. That takes a long time. Clearly this slowness is unacceptable for a deployer.

I see two paths to try to ease the problem

Pre-pulling images

With every scap deployment, we distribute the command to all k8s nodes to pull the newest mediawiki multiversion image (and maybe prune old ones up to N versions before), then do a forced redeployment of the pods using some varying annotation, which should happen "as fast as possible" at that point.

the pre-pulling should only take a relatively long time when running on an image rebuilt from scratch, so I guess when running in the automated train presync process.

Use shared volumes on the k8s nodes, distribute code via scap

This is self-explanatory I think - basically the idea would be to give up on building the code into the docker images, and just mount it as a local read-only volume in k8s

Given this second option wasn't preferrred by most people, I'll try to explore the first option as a starter

Event Timeline

Joe triaged this task as High priority.Nov 18 2022, 6:43 AM
Joe created this task.

Change 858543 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] role::kubernetes::wroker: allow scap to pre-pull mediawiki images

https://gerrit.wikimedia.org/r/858543

Please make sure not to pre-pull images on tainted nodes (which are masters and kask/sessionstore currently).

Change 858987 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap::dsh: add kubernetes-workers dsh list

https://gerrit.wikimedia.org/r/858987

Change 858988 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: add mw on k8s dsh list

https://gerrit.wikimedia.org/r/858988

Change 858543 merged by Giuseppe Lavagetto:

[operations/puppet@production] role::kubernetes::wroker: allow scap to pre-pull mediawiki images

https://gerrit.wikimedia.org/r/858543

Change 858987 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap::dsh: add kubernetes-workers dsh list

https://gerrit.wikimedia.org/r/858987

Change 858988 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: add mw on k8s dsh list

https://gerrit.wikimedia.org/r/858988

Status update: with the pre-pulling activated, the deployment times for a small patch are in the order of 2-3 minutes on k8s, plus maybe another minute to build the image.

We've turned on image building for all sync operations on the deployment hosts; next week we'll start also deploying automatically to k8s with every scap deployment.

We can consider this task resolved.