Page MenuHomePhabricator

Provide some feedback in scap whilst waiting for helmfile deploys to complete
Open, Needs TriagePublic

Description

Now that the majority (and soon, all) of the 'deploy' work of scap is for MW-on-k8s, the long pause between the helmfile start and end logs with no intervening updates is rather unnerving:

20:14:35 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid
… long 5m wait, deployer worries that prod has crashed, etc.
20:19:27 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 04m 52s)

There's also T325530: scap: hide helmfile operations behind a progress bar about grouping the clusters together, which would also be nice, but this is more about some feedback during the helmchart operation

Details

TitleReferenceAuthorSource BranchDest Branch
Simple k8s deployment progress reportingrepos/releng/scap!292dancymaster-I609482aa2fed02a864543b5c3987d0dad91de254master
Customize query in GitLab

Event Timeline

@Dreamy_Jazz and I were also thinking something similar the other day:

<Lucas_WMDE> 	would be cool to have more visibility into it though, like the in-flight / ok / fail / left numbers for bare-metal steps 
…
<Dreamy_Jazz> 	It would be nice to have more visual output on the command that does the k8s restarts (i.e. a progress bar of some kind)

For comparison, bare-metal deploys show a live number of “in-flight”, “ok”, “fail” and “left” servers:

php-fpm-restart: 100% (in-flight: 0; ok: 141; fail: 0; left: 0)

It looks like it’s possible to get similar information out of Kubernetes:

lucaswerkmeister-wmde@deploy1002 ~ $ kube-env mw-web eqiad
lucaswerkmeister-wmde@deploy1002 ~ $ kubectl get deployments
NAME                  READY     UP-TO-DATE   AVAILABLE   AGE
mw-web.eqiad.canary   7/7       7            7           319d
mw-web.eqiad.main     208/211   142          208         319d

I’m guessing that in this snapshot, “in-flight” would be 3 (211-208 – 211 pods in total, 208 of which are ready, so the other three must be the ones currently being restarted); “ok” would be 142 (the number of pods with an up-to-date image); “fail” might not have a direct equivalent(?); and “left” would be 69 (211-142 – all the pods that don’t have an up-to-date image yet) or 66 (208-142 – if the “in-flight” ones don’t count as “left”).

But I don’t really have any idea how to usefully present this information when all the helmfile commands are running in parallel :S

It looks like it’s possible to get similar information out of Kubernetes:

Slightly refined version until something better comes along:

lucaswerkmeister-wmde@deploy1002 ~ $ headers=; for dc in eqiad codfw; do for service in mw-web mw-api-ext mw-api-int mw-jobrunner mw-parsoid mw-wikifunctions; do kube_env $service $dc; kubectl get deployment $service.$dc.main $headers; headers=--no-headers; done; done | sed -E --unbuffered 's/ +/\t/g' | expand -t30,40,52,64
NAME                          READY     UP-TO-DATE  AVAILABLE   AGE
mw-web.eqiad.main             223/223   223         223         329d
mw-api-ext.eqiad.main         140/140   140         140         329d
mw-api-int.eqiad.main         250/250   250         250         329d
mw-jobrunner.eqiad.main       180/180   180         180         329d
mw-parsoid.eqiad.main         141/141   141         141         58d
mw-wikifunctions.eqiad.main   2/2       2           2           190d
mw-web.codfw.main             223/223   223         223         329d
mw-api-ext.codfw.main         140/140   140         140         329d
mw-api-int.codfw.main         250/250   250         250         329d
mw-jobrunner.codfw.main       180/180   180         180         329d
mw-parsoid.codfw.main         141/141   141         141         57d
mw-wikifunctions.codfw.main   2/2       2           2           190d
dancy changed the task status from Open to In Progress.Fri, Apr 26, 5:00 PM
dancy claimed this task.
dancy triaged this task as Medium priority.
dancy changed the task status from In Progress to Open.Fri, Apr 26, 5:54 PM
dancy removed dancy as the assignee of this task.
dancy raised the priority of this task from Medium to Needs Triage.
dancy subscribed.

I completed related work in T325530 and I'm inclined to leave it at that for now.