On toolforge at least there's a noticeable increase in the delay for any build action, investigate why and solve if possible.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
In Progress | Raymond_Ndibe | T362867 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 | |||
Resolved | dcaro | T374908 [builds-builder,builds-api] Upgrade tekton | |||
Resolved | dcaro | T376710 [builds-builder,builds-api] After the upgrade build actions take >10s |
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2024-10-08T12:27:13Z] <dcaro> upgrade finished, build actions have become slower than usual (T376710), running tests and investigating
dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/59
api_client: increase timeout to 20s
dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/59
api_client: increase timeout to 20s
dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/60
d/changelog: bump to 1.6.4
Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:52:46Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component toolforge-weld (T376710)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:52:50Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component toolforge-weld (T376710)
dcaro closed https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/60
d/changelog: bump to 1.6.4
dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/61
d/changelog: bump to 1.6.4
Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:55:30Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component toolforge-weld (T376710)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T12:59:39Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component toolforge-weld (T376710)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T13:27:23Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component toolforge-weld (T376710)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-08T13:34:22Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component toolforge-weld (T376710)
dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/61
d/changelog: bump to 1.6.4
Done a bit of investigating, the issue seems to be the way the conversion hook work in combination with k8s filtering.
When getting all the pipelineruns for a user, we do something similar to:
kubectl get pipelinerun -n image-build -l user=<tool>
As all the pipelineruns are in the image-build namespace, it has to:
- Convert the pipelinerun to the latest version
- Then check for the label
For *every* pipelinerun, *every* time we do a request, that is currently 762 request to the conversion webhook for every request to the builds-api :/
If we were able to filter before the webhook, that amount would be reduced considerably:
tools.wm-lol@tools-bastion-13:~$ time kubectl get pipelineruns -n image-build -o name -l user=wm-lol pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-76xwf pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9fj6l pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9m6nv pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-dk4gj pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-ftwfw pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-fw6kf pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-plmlq real 0m11.210s user 0m0.193s sys 0m0.052s ### VS tools.wm-lol@tools-bastion-13:~$ time kubectl get -n image-build pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-76xwf pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9fj6l pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-9m6nv pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-dk4gj pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-ftwfw pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-fw6kf pipelinerun.tekton.dev/wm-lol-buildpacks-pipelinerun-plmlq NAME SUCCEEDED REASON STARTTIME COMPLETIONTIME wm-lol-buildpacks-pipelinerun-76xwf True Succeeded 20h 20h wm-lol-buildpacks-pipelinerun-9fj6l True Succeeded 57d 57d wm-lol-buildpacks-pipelinerun-9m6nv True Succeeded 43d 43d wm-lol-buildpacks-pipelinerun-dk4gj False Failed 176d 176d wm-lol-buildpacks-pipelinerun-ftwfw True Succeeded 71d 71d wm-lol-buildpacks-pipelinerun-fw6kf True Succeeded 77d 77d wm-lol-buildpacks-pipelinerun-plmlq False Failed 176d 176d real 0m0.654s user 0m0.251s sys 0m0.117s
Looking into ways of minimizing that, in the meantime, I've scaled up the webhook, though it does not seem to have a measurable effect on the time to complete a single request, it allows for parallel request to be served without problems.
This would be fixed over time, as the old builds get cleaned (so the stored version and the served version become the same), or we can try upgrading the CRDs to the newer version, maybe using https://github.com/kubernetes-sigs/kube-storage-version-migrator (mentioned in https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#upgrade-existing-objects-to-a-new-stored-version).
dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/554
builds-builder: add script to migrate pipelineruns from v1beta1 to v1
Mentioned in SAL (#wikimedia-cloud) [2024-10-14T09:14:12Z] <dcaro> migrating pipelineruns stored versions to v1 (T376710)
dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/554
builds-builder: add script to migrate pipelineruns from v1beta1 to v1
All the pipelineruns were upgraded to v1 storage version, we are back to <2s response times