Page MenuHomePhabricator

[builds-builder,builds-api] After the upgrade build actions take >10s
Closed, ResolvedPublic

Description

On toolforge at least there's a noticeable increase in the delay for any build action, investigate why and solve if possible.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2024-10-08T12:27:13Z] <dcaro> upgrade finished, build actions have become slower than usual (T376710), running tests and investigating

dcaro triaged this task as High priority.Tue, Oct 8, 12:44 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:52:46Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component toolforge-weld (T376710)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:52:50Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component toolforge-weld (T376710)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:55:30Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component toolforge-weld (T376710)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:59:39Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component toolforge-weld (T376710)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T13:27:23Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component toolforge-weld (T376710)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T13:34:22Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component toolforge-weld (T376710)

Done a bit of investigating, the issue seems to be the way the conversion hook work in combination with k8s filtering.

When getting all the pipelineruns for a user, we do something similar to:

kubectl get pipelinerun -n image-build -l user=<tool>

As all the pipelineruns are in the image-build namespace, it has to:

  • Convert the pipelinerun to the latest version
  • Then check for the label

For *every* pipelinerun, *every* time we do a request, that is currently 762 request to the conversion webhook for every request to the builds-api :/

If we were able to filter before the webhook, that amount would be reduced considerably:

tools.wm-lol@tools-bastion-13:~$ time kubectl get pipelineruns -n image-build  -o name -l user=wm-lol
pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-76xwf
pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9fj6l
pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9m6nv
pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-dk4gj
pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-ftwfw
pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-fw6kf
pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-plmlq

real    0m11.210s
user    0m0.193s
sys     0m0.052s


### VS

tools.wm-lol@tools-bastion-13:~$ time kubectl get -n image-build pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-76xwf pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9fj6l pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9m6nv pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-dk4gj pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-ftwfw pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-fw6kf pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-plmlq
NAME                                  SUCCEEDED   REASON      STARTTIME   COMPLETIONTIME
wm-lol-buildpacks-pipelinerun-76xwf   True        Succeeded   20h         20h
wm-lol-buildpacks-pipelinerun-9fj6l   True        Succeeded   57d         57d
wm-lol-buildpacks-pipelinerun-9m6nv   True        Succeeded   43d         43d
wm-lol-buildpacks-pipelinerun-dk4gj   False       Failed      176d        176d
wm-lol-buildpacks-pipelinerun-ftwfw   True        Succeeded   71d         71d
wm-lol-buildpacks-pipelinerun-fw6kf   True        Succeeded   77d         77d
wm-lol-buildpacks-pipelinerun-plmlq   False       Failed      176d        176d

real    0m0.654s
user    0m0.251s
sys     0m0.117s

Looking into ways of minimizing that, in the meantime, I've scaled up the webhook, though it does not seem to have a measurable effect on the time to complete a single request, it allows for parallel request to be served without problems.

This would be fixed over time, as the old builds get cleaned (so the stored version and the served version become the same), or we can try upgrading the CRDs to the newer version, maybe using https://github.com/kubernetes-sigs/kube-storage-version-migrator (mentioned in https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#upgrade-existing-objects-to-a-new-stored-version).

Mentioned in SAL (#wikimedia-cloud) [2024-10-14T09:14:12Z] <dcaro> migrating pipelineruns stored versions to v1 (T376710)

dcaro moved this task from Next Up to Done on the Toolforge (Toolforge iteration 15) board.

All the pipelineruns were upgraded to v1 storage version, we are back to <2s response times