Page MenuHomePhabricator

[components-api] Queue builds when the build queue is full
Open, In Progress, HighPublic

Description

Currently the components-api just fails the build if the build queue is full, instead it should wait for the queue to shrink and retry triggering the build. The error it gives currently is:

Got exception: Some builds failed to start: export-statistics(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds) grafana-alloy(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds) grant-review-access-from-wikipedia-rights(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds) import-training-data(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds) irc-relay(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds) mark-edits-as-deleted(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds) mark-edits-as-having-data(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds) update-edit-classifications(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng-review/builds)

Event Timeline

dcaro triaged this task as High priority.

raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/131

[deploy_task, tool_handlers] queue deployments to allow creation of multiple deployments at once

Raymond_Ndibe changed the task status from Open to In Progress.Sep 12 2025, 6:44 PM

raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143

[build, api] support build queueing beyond max_parallel build config

Raymond_Ndibe changed the task status from In Progress to Stalled.Oct 22 2025, 2:41 AM

Hit again today while bumping releases on nearly everything

Deployment ID: 20251111-152523-4chqpjcf8c
Created: 20251111-152523
Status: failed
Long status: 
  Got exception: Some builds failed to start: redis(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng/builds) report-interface(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng/builds)

Builds:
  backup-database(pending): id:cluebotng-buildpacks-pipelinerun-tb29g Not started yet
  bot(skipped): id:cluebotng-buildpacks-pipelinerun-pbgmc Reusing existing build
  core(pending): id:cluebotng-buildpacks-pipelinerun-6wnh9 Not started yet
  grafana-alloy(pending): id:cluebotng-buildpacks-pipelinerun-cjd5t Not started yet
  irc-relay(pending): id:cluebotng-buildpacks-pipelinerun-f2t68 Not started yet
  prune-backups(skipped): id:no-build-needed Component re-uses build from backup-database
  pushgateway(skipped): id:cluebotng-buildpacks-pipelinerun-xggpd Reusing existing build
  redis(failed): id:no-id-yet Got too many builds running (4 out of 4 max), cancel some or wait for them to finish
  report-interface(failed): id:no-id-yet Got too many builds running (4 out of 4 max), cancel some or wait for them to finish
  report-review-import(skipped): id:no-build-needed Component re-uses build from report-interface
  report-send-whitelisted-to-review(skipped): id:no-build-needed Component re-uses build from report-interface

Interestingly we don't cancel the builds when one is failed to be created, that is handy because it means re-running the deployment after the pending builds has finished works around this, but perhaps the more correct behaviour is to about the builds when the deployment aborts early (would also apply to queued builds)

This task has to be extended a bit more on what options do we have to implement this, some early suggestions:

  • On components-api side, wait for a spot in the build queue once it's full (retrying with a delay when the build queue is full).
  • On build-api, extend the queue limits
  • On builds-api, use a queue system (rabbitmq, etc.) with workers to do the queued builds
  • On builds-api, use a queue system (bulit-in, ex. using CRDs on k8s) with workers to do the queued builds (see https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143)
dcaro changed the task status from Stalled to In Progress.Mar 3 2026, 3:19 PM

In the last few weeks this has caused an outage on ClueBot NG twice (as dependency updates happen).

Latest was from this morning where the container got restarted and then could not pull the image (which was trashed as part of the failed build).

tools.cluebotng@tools-bastion-15:~$ kubectl get pods
NAME                               READY   STATUS             RESTARTS        AGE
backup-database-29628885-jwjxm     0/1     ImagePullBackOff   0               40h
bot-b8dc6cbbd-cggsf                0/1     ImagePullBackOff   1 (4h32m ago)   10d
core-69df6577f4-7k9hl              1/1     Running            0               10d
grafana-alloy-57576dfb47-4dxt7     1/1     Running            0               10d
irc-relay-77449dc8f9-xx6v8         1/1     Running            0               9d
prune-backups-29629770-zrqkp       0/1     ImagePullBackOff   0               25h
pushgateway-5b5bf95974-bdvtn       1/1     Running            0               10d
redis-659bf7b88d-8qmfw             1/1     Running            0               10d
report-interface-9bf7b88d5-lgg4v   1/1     Running            0               10d
report-interface-9bf7b88d5-x47th   1/1     Running            0               10d
test-58f9df5bc7-7pwzs              1/1     Running            0               10d
tools.cluebotng@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260502-125637-v8it3e8dqr
Created: 20260502-125637
Status: failed
Long status: 
  Got exception: Some builds failed to start: irc-relay(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng/builds)

Builds:
  backup-database(pending): id:cluebotng-buildpacks-pipelinerun-pmfjk Not started yet
  bot(pending): id:cluebotng-buildpacks-pipelinerun-x6lzb Not started yet
  core(pending): id:cluebotng-buildpacks-pipelinerun-bltnw Not started yet
  grafana-alloy(pending): id:cluebotng-buildpacks-pipelinerun-c64cm Not started yet
  irc-relay(failed): id:no-id-yet Got too many builds running (4 out of 4 max), cancel some or wait for them to finish
  prune-backups(skipped): id:no-build-needed Component re-uses build from backup-database
  pushgateway(skipped): id:cluebotng-buildpacks-pipelinerun-g4ptv Reusing existing build
  redis(skipped): id:cluebotng-buildpacks-pipelinerun-ld457 Reusing existing build
  report-interface(skipped): id:cluebotng-buildpacks-pipelinerun-thqdk Reusing existing build
  report-review-import(skipped): id:no-build-needed Component re-uses build from report-interface
  report-send-whitelisted-to-review(skipped): id:no-build-needed Component re-uses build from report-interface

The usual workaround fixes it, but this requires a bunch of maintainer work that shouldn't be required

tools.cluebotng@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260504-065111-w6j3gaucuf
Created: 20260504-065111
Status: successful
Long status: 
  Finished at 2026-05-04 06:53:00.851009

Builds:
  backup-database(skipped): id:cluebotng-buildpacks-pipelinerun-pmfjk Reusing existing build
  bot(skipped): id:cluebotng-buildpacks-pipelinerun-x6lzb Reusing existing build
  core(skipped): id:cluebotng-buildpacks-pipelinerun-bltnw Reusing existing build
  grafana-alloy(skipped): id:cluebotng-buildpacks-pipelinerun-c64cm Reusing existing build
  irc-relay(successful): id:cluebotng-buildpacks-pipelinerun-s6b2q You can see the logs with `toolforge build logs cluebotng-buildpacks-pipelinerun-s6b2q`
  prune-backups(skipped): id:no-build-needed Component re-uses build from backup-database
  pushgateway(successful): id:cluebotng-buildpacks-pipelinerun-5798z You can see the logs with `toolforge build logs cluebotng-buildpacks-pipelinerun-5798z`
  redis(successful): id:cluebotng-buildpacks-pipelinerun-gx6dz You can see the logs with `toolforge build logs cluebotng-buildpacks-pipelinerun-gx6dz`
  report-interface(successful): id:cluebotng-buildpacks-pipelinerun-4g4fm You can see the logs with `toolforge build logs cluebotng-buildpacks-pipelinerun-4g4fm`
  report-review-import(skipped): id:no-build-needed Component re-uses build from report-interface
  report-send-whitelisted-to-review(skipped): id:no-build-needed Component re-uses build from report-interface

I've added some retry logic to our deployment script, which unfortunately increases the general load on tooling - cluebotng takes 3 deployment attempts before all the images are satisfied.

raymond-ndibe closed https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/143

[build, api] support build queueing beyond max_parallel build config

raymond-ndibe closed https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/131

[deploy_task, tool_handlers] queue deployments to allow creation of multiple deployments at once