Page MenuHomePhabricator

Helm deployment of MediaWiki now takes 6 minutes
Closed, DeclinedPublic

Description

When conducting the MediaWiki train last week and running some backports this week, I have noticed it now takes 10+ minutes to deploy a change. I thought building the image, localization cache or pulling the images might be the cause of the slowness. After looking at the log of a deployment I did this morning, I think most of the time is spent in helm deployment:

08:48:54 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 24s)
08:49:13 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 44s)
08:52:30 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 04m 00s)
08:52:30 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 04m 00s)
08:52:42 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 04m 13s)
08:52:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 04m 15s)
08:52:48 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 04m 18s)
08:54:05 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 05m 35s)
08:54:21 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 05m 52s)
08:54:23 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 05m 53s)
08:54:36 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 06m 07s)
08:54:37 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 06m 07s)

That is up to 6 minutes which leads me to wonder what might happen under the hood.

Since scap logs timing data, in Kibana we can graph the time it took for one of those commands over a period of time. Taking the ~ last 3 months for a single command:

Dashboard view (time in nanoseconds):

scap_helm_timing.png (482×920 px, 66 KB)

The mean (yellow) starts at 1m10s, the more recent one is at 5 minutes.

Some bumps are noticeable on January 24th, Feb 15th, March 4 and March 6th which accumulates.

This task is merely to capture my observation, it does not need to be acted on anytime soon given we are still doing the migration.

Event Timeline

I wanted to point out that as the migration progresses and the size of MediaWiki deployments in WikiKube increases, it is inevitable that the deployment times for MW-on-K8s will increase too. Right now, we upgrade to each new version in chunks of 3% (16d6e717a7a) of the total. This is a relatively latest development, in the past we upgraded in larger chunks, since the overall size of each deployment was smaller. I expect those numbers to increase more, but I also expect the numbers for scap deploying to "legacy" infrastructure to decrease. Not proportionally of course.

I wouldn't worry about the actual numbers per deployment too much. What's more important to me, UX wise, is that the wall clock time during a scap deploy, remains acceptable.

I used scap backport to deploy a mediawiki-config change today. The sync part of the operation took 15 minutes to complete. As an occasional user of scap, this felt "too long". That limits us to 4 backports per hour. I think under 10 minutes is a reasonable target. Anyway, here's a manual time profile of the operation:

sync-prod-k8s:                   06m 17s
php-fpm-restarts:                02m 51s (bm)
sync-canaries-k8s:               00m 54s
php-fpm-restarts (canaries):     00m 48s (bm)
check-testservers:               00m 39s
build-and-push-container-images: 00m 30s
sync-apaches:                    00m 30s (bm)
sync-testservers-k8s:            00m 27s
canary traffic wait:             00m 20s
sync-testservers:                00m 16s (bm)
sync-masters:                    00m 11s
sync-proxies:                    00m 07s (bm)

Items marked (bm) are bare-metal-only operations that should disappear eventually. That's 4m32s of time.

Moving this to our radar as I don't think Release-Engineering-Team can do anything about this directly right now—this is just how long it takes to deploy. And it sounds like its expected during the migration to mw-on-k8s. Though speeding this up will be important as the migration stabilizes.

Also tagging in serviceops in case there's anything that can be done in the near term.

Moving it to our radar too as we intend to revisit various parts of all of this (e.g. how we do MultiVersion once we are no longer constrained by the legacy infra), but we don't have something concrete right now.

That being said, there is 1 think that struck me as weird and it's probably an historical artifact of how scap does rollbacks, which can hopefully be fixed later on.

I used scap backport to deploy a mediawiki-config change today.
sync-prod-k8s: 06m 17s

Arguably, this ^ shouldn't take so long. A rollback can be just a signal to the platform to use the previous helm release deployment (that's what helm rollback does btw) not an entire new code deployment. I suspect that:

  • there is some tangling with the train's group0, group1, group2 (at least in some cases) that scap is aware of and makes no-code rollbacks a tad more difficult
  • we lack the necessary glue bits code (and maybe some state) to just make scap rollback switch to the previous image.

Note that due to the use of helmfile, which doesn't natively support rollback even though helm does, we might have to end up being creative. We apparently already have been a bit in scap, I can see code that does helm rollback in case of failed pending-upgrade states.

I am declining the task given it is on radar and 6 minutes Kubernetes deployment seems to be the expected duration (due to the 3% chunks and the CPU usage pressure on the cluster as explained above T360403#9641095).