Improve performance of deployment to mw on k8s
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Nov 18 2022, 6:43 AM

Description

While full image build (should usually only happen for train deployments and/or l10n updates) will take a long time, and I think that is generally acceptable, most of the times even a simple backport deployment can take up to 20 minutes on k8s at the moment.

Specifically:

the image build/publish is fast (< 1 minute)
the deployment are randomly varying between very fast (< 1 minute) and slow (~ 8 minutes)

The reason for the occasional slowness is that sometimes we allocate a pod to a k8s node that hasn't downloaded the "fat" base image before and needs to download and extract a 7 GB set of image layers. That takes a long time. Clearly this slowness is unacceptable for a deployer.

I see two paths to try to ease the problem

Pre-pulling images

With every scap deployment, we distribute the command to all k8s nodes to pull the newest mediawiki multiversion image (and maybe prune old ones up to N versions before), then do a forced redeployment of the pods using some varying annotation, which should happen "as fast as possible" at that point.

the pre-pulling should only take a relatively long time when running on an image rebuilt from scratch, so I guess when running in the automated train presync process.

Use shared volumes on the k8s nodes, distribute code via scap

This is self-explanatory I think - basically the idea would be to give up on building the code into the docker images, and just mount it as a local read-only volume in k8s

Given this second option wasn't preferrred by most people, I'll try to explore the first option as a starter

Details

Subject	Repo	Branch	Lines +/-
scap: add mw on k8s dsh list	operations/puppet	production	+7 -4
scap::dsh: add kubernetes-workers dsh list	operations/puppet	production	+2 -0
role::kubernetes::wroker: allow scap to pre-pull mediawiki images	operations/puppet	production	+62 -0

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	kubernetes: allow pre-pulling the multiversion image on k8s nodes	repos/releng/scap!30	oblivian	pre-pull	master
	kubernetes: allow pre-pulling the multiversion image on k8s nodes	repos/releng/scap!27	oblivian	pre-pull	master

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T255792 Quibble runs core:unit tests twice!
Open	None	T328919 Upgrade to PHPUnit 10
Open	None	T338103 Micro-optimize ApiResult::isMetadataKey with str_starts_with once we support PHP8+
Open	None	T328921 Drop PHP 7.4 support from MediaWiki
Stalled	None	T334726 Use return type `never` in Wikibase
Open	None	T328922 Drop PHP 8.0 support from MediaWiki
Stalled	None	T319055 Upgrade to psr/container 2.x
Stalled	Krinkle	T319432 Migrate WMF production from PHP 7.4 to PHP 8.1
Open	None	T291916 Tracking task for Bullseye migrations in production
Stalled	None	T356293 Migrate MW appservers' base images to bullseye
Open	None	T290536 Serve production traffic via Kubernetes
Resolved	Joe	T323349 Improve performance of deployment to mw on k8s

Event Timeline

Joe triaged this task as High priority.Nov 18 2022, 6:43 AM

Joe created this task.

Joe moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Nov 18 2022, 6:47 AM

Joe claimed this task.Nov 18 2022, 8:14 AM

Change 858543 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] role::kubernetes::wroker: allow scap to pre-pull mediawiki images

https://gerrit.wikimedia.org/r/858543

gerritbot added a project: Patch-For-Review.Nov 18 2022, 8:40 AM

Please make sure not to pre-pull images on tainted nodes (which are masters and kask/sessionstore currently).

Change 858987 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap::dsh: add kubernetes-workers dsh list

https://gerrit.wikimedia.org/r/858987

Change 858988 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: add mw on k8s dsh list

https://gerrit.wikimedia.org/r/858988

Change 858543 merged by Giuseppe Lavagetto:

[operations/puppet@production] role::kubernetes::wroker: allow scap to pre-pull mediawiki images

https://gerrit.wikimedia.org/r/858543

Change 858987 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap::dsh: add kubernetes-workers dsh list

https://gerrit.wikimedia.org/r/858987

Change 858988 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: add mw on k8s dsh list

https://gerrit.wikimedia.org/r/858988

Maintenance_bot removed a project: Patch-For-Review.Nov 22 2022, 10:30 AM

Status update: with the pre-pulling activated, the deployment times for a small patch are in the order of 2-3 minutes on k8s, plus maybe another minute to build the image.

We've turned on image building for all sync operations on the deployment hosts; next week we'll start also deploying automatically to k8s with every scap deployment.

We can consider this task resolved.

Improve performance of deployment to mw on k8sClosed, ResolvedPublicActions