Page MenuHomePhabricator

[components-api] failing deployment 422 from jobs-api
Closed, ResolvedPublicBUG REPORT

Description

Previous build failed due to harbor quota and got the jobs stuck.

Clearing out the builds/jobs and making a new deployment has them failing on creation.

tools.cluebotng-trainer@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260408-220715-f39kf7ufyt
Created: 20260408-220715
Status: failed
Long status: 
  Got exception: Failed run for component grafana-alloy: 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-trainer/jobs/ (422): no details

Builds:
  coordinator(successful): id:cluebotng-trainer-buildpacks-pipelinerun-6kxpf You can see the logs with `toolforge build logs cluebotng-trainer-buildpacks-pipelinerun-6kxpf`
  file-api(successful): id:cluebotng-trainer-buildpacks-pipelinerun-fpb2w You can see the logs with `toolforge build logs cluebotng-trainer-buildpacks-pipelinerun-fpb2w`
  grafana-alloy(successful): id:cluebotng-trainer-buildpacks-pipelinerun-zqzvk You can see the logs with `toolforge build logs cluebotng-trainer-buildpacks-pipelinerun-zqzvk`
  pushgateway(successful): id:cluebotng-trainer-buildpacks-pipelinerun-k8d79 You can see the logs with `toolforge build logs cluebotng-trainer-buildpacks-pipelinerun-k8d79`

Runs:
  coordinator(successful): created or updated job coordinator, [info](Job coordinator was updated in storage and runtime)
  file-api(successful): created or updated job file-api, [info](Job file-api was updated in storage and runtime)
  grafana-alloy(failed): 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-trainer/jobs/ (422): no details
  pushgateway(skipped): Skipped due to previous failure

Tool config:
  components:
    coordinator:
      build:
        ref: refs/tags/v1.4.16
        repository: https://github.com/cluebotng/trainer.git
        use_latest_versions: true
      run:
        command: launcher ./deployment/entrypoint.sh run-edit-sets
        mount: all
        schedule: 13 13 * * *
    file-api:
      build:
        ref: refs/tags/v1.2.4
        repository: https://github.com/cluebotng/fileserver.git
        use_latest_versions: true
      run:
        command: run-fileserver
        mount: all
        port: 8000
    grafana-alloy:
      build:
        ref: refs/tags/v0.4.8
        repository: https://github.com/cluebotng/external-grafana-alloy.git
        use_latest_versions: true
      run:
        command: run-alloy
        health_check_http: /health
        mount: all
        port: 8118
    pushgateway:
      build:
        ref: refs/tags/v0.0.4
        repository: https://github.com/cluebotng/external-pushgateway.git
        use_latest_versions: true
      run:
        command: run-pushgateway
        health_check_http: /-/healthy
        mount: all
        port: 9091
tools.cluebotng-trainer@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260408-221003-y37iloyuxw
Created: 20260408-221003
Status: failed
Long status: 
  Got exception: Failed run for component grafana-alloy: 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-trainer/jobs/ (422): no details

Builds:
  coordinator(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-6kxpf Reusing existing build
  file-api(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-fpb2w Reusing existing build
  grafana-alloy(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-zqzvk Reusing existing build
  pushgateway(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-k8d79 Reusing existing build

Runs:
  coordinator(successful): job coordinator is already up to date, [info](Job coordinator is already up to date)
  file-api(successful): job file-api is already up to date, [info](Job file-api is already up to date)
  grafana-alloy(failed): 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-trainer/jobs/ (422): no details
  pushgateway(skipped): Skipped due to previous failure

Tool config:
  components:
    coordinator:
      build:
        ref: refs/tags/v1.4.16
        repository: https://github.com/cluebotng/trainer.git
        use_latest_versions: true
      run:
        command: launcher ./deployment/entrypoint.sh run-edit-sets
        mount: all
        schedule: 13 13 * * *
    file-api:
      build:
        ref: refs/tags/v1.2.4
        repository: https://github.com/cluebotng/fileserver.git
        use_latest_versions: true
      run:
        command: run-fileserver
        mount: all
        port: 8000
    grafana-alloy:
      build:
        ref: refs/tags/v0.4.8
        repository: https://github.com/cluebotng/external-grafana-alloy.git
        use_latest_versions: true
      run:
        command: run-alloy
        health_check_http: /health
        mount: all
        port: 8118
    pushgateway:
      build:
        ref: refs/tags/v0.0.4
        repository: https://github.com/cluebotng/external-pushgateway.git
        use_latest_versions: true
      run:
        command: run-pushgateway
        health_check_http: /-/healthy
        mount: all
        port: 9091

Trying with the previous config (file-api was changed to try and fix the 422, but then grafana started failing)

tools.cluebotng-trainer@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260408-221101-b5es1o04q3
Created: 20260408-221101
Status: failed
Long status: 
  Got exception: Failed run for component file-api: 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-trainer/jobs/ (422): no details

Builds:
  coordinator(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-6kxpf Reusing existing build
  file-api(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-fpb2w Reusing existing build
  grafana-alloy(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-zqzvk Reusing existing build
  pushgateway(skipped): id:cluebotng-trainer-buildpacks-pipelinerun-k8d79 Reusing existing build

Runs:
  coordinator(successful): job coordinator is already up to date, [info](Job coordinator is already up to date)
  file-api(failed): 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-trainer/jobs/ (422): no details
  grafana-alloy(skipped): Skipped due to previous failure
  pushgateway(skipped): Skipped due to previous failure

Tool config:
  components:
    coordinator:
      build:
        ref: refs/tags/v1.4.16
        repository: https://github.com/cluebotng/trainer.git
        use_latest_versions: true
      run:
        command: launcher ./deployment/entrypoint.sh run-edit-sets
        mount: all
        schedule: 13 13 * * *
    file-api:
      build:
        ref: refs/tags/v1.2.4
        repository: https://github.com/cluebotng/fileserver.git
        use_latest_versions: true
      run:
        command: run-fileserver
        health_check_http: /_/health
        mount: all
        port: 8000
        replicas: 2
    grafana-alloy:
      build:
        ref: refs/tags/v0.4.8
        repository: https://github.com/cluebotng/external-grafana-alloy.git
        use_latest_versions: true
      run:
        command: run-alloy
        health_check_http: /health
        mount: all
        port: 8118
    pushgateway:
      build:
        ref: refs/tags/v0.0.4
        repository: https://github.com/cluebotng/external-pushgateway.git
        use_latest_versions: true
      run:
        command: run-pushgateway
        health_check_http: /-/healthy
        mount: all
        port: 9091

Jobs are missing as expected

tools.cluebotng-trainer@tools-bastion-15:~$ toolforge jobs list
+-------------+-----------------------+----------------------------+
|  Job name:  |       Job type:       |          Status:           |
+-------------+-----------------------+----------------------------+
| coordinator | schedule: 13 13 * * * | Waiting for scheduled time |
|  file-api   |      continuous       |          Running           |
+-------------+-----------------------+----------------------------+

Need someone to check the logs to see why this is failing, nothing has changed aside from a new image build.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
openapi - fix health check discriminatorrepos/cloud/toolforge/jobs-api!285damianfeature/fix-http-health-check-type-mappingmain
Customize query in GitLab

Event Timeline

dcaro subscribed.

I'm able to reproduce in lima-kilo with:

$ toolforge components config generate  | toolforge components config create 
$ toolforge components deployment create
... wait for it to finish
$ toolforge build delete --all
$ toolforge build clean
$ toolforge components deployment create

The jobs api does not really give any extra info:

│ INFO:     127.0.0.1:42416 - "PATCH /v1/tool/tf-test2/jobs/ HTTP/1.0" 422 Unprocessable Entity                                                                                                                                               │

Might have to add some extra logging

Found something using restish to try to patch the same way components did:

msg: "Input tag 'path' found using 'health_check_type' | 'type' does not match any of the expected tags: <HealthCheckType.SCRIPT: 'script'>, <HealthCheckType.HTTP: 'http'>"

I think that we are sending the wrong data for the healthcheck

Hmmm... I think that there might be some issue when generating the models from the toolforge openapi spec on components-api, as it generated this:

class JobsHttpHealthCheck(BaseModel):
    type: Literal["path"]
    path: str

When the docs say:

dcaro@acme$ jq '.components.schemas.jobsHttpHealthCheck' openapi.json 
{
  "type": "object",
  "description": "Health check implemented as an HTTP request inside the container",
  "required": [
    "type",
    "path"
  ],
  "properties": {
    "type": {
      "$ref": "#/components/schemas/jobsHealthCheckType"
    },
    "path": {
      "type": "string"
    }
  }
}

dcaro@acme$ jq '.components.schemas.jobsHealthCheckType' openapi.json 
{
  "type": "string",
  "enum": [
    "script",
    "http"
  ]
}

So yep, not the literal 'type'

I was going to try and re-produce this on staging (removing the health check), but it now seems to be stuck waiting for builds to start (20260409-180511-p2a131uce6) =\

Actually the builds did run, but the deployment status never updated from 'pending'... previous the started builds would get their status updated even if one failed before they finished (parallel quota hit).... perhaps that was just an un-intended behaviour.

Just as an experiment I tried re-deploying another tool

tools.cluebotng-staging@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260327-174440-uvst3iae15
Created: 20260327-174440
Status: successful
Long status:
  Finished at 2026-03-27 17:45:53.015566

Builds:
  bot(skipped): id:cluebotng-staging-buildpacks-pipelinerun-ntwth Reusing existing build
  core(successful): id:cluebotng-staging-buildpacks-pipelinerun-hjhwk You can see the logs with `toolforge build logs cluebotng-staging-buildpacks-pipelinerun-hjhwk`
  grafana-alloy(skipped): id:cluebotng-staging-buildpacks-pipelinerun-ns565 Reusing existing build
  redis(successful): id:cluebotng-staging-buildpacks-pipelinerun-48bkg You can see the logs with `toolforge build logs cluebotng-staging-buildpacks-pipelinerun-48bkg`
  report-interface(successful): id:cluebotng-staging-buildpacks-pipelinerun-472v2 You can see the logs with `toolforge build logs cluebotng-staging-buildpacks-pipelinerun-472v2`

Runs:
  bot(successful): job bot is already up to date, [info](Job bot is already up to date)
  core(successful): created or updated job core, [info](Job core was updated in storage and runtime)
  grafana-alloy(successful): job grafana-alloy is already up to date, [info](Job grafana-alloy is already up to date)
  redis(successful): created or updated job redis, [info](Job redis was updated in storage and runtime)
  report-interface(successful): created or updated job report-interface, [info](Job report-interface was updated in storage and runtime)

Tool config:
  components:
    bot:
      build:
        ref: refs/tags/v0.0.40
        repository: https://github.com/cluebotng/botng.git
        use_latest_versions: true
      run:
        command: botng --irc-relay --debug
        health_check_http: /health
        port: 8118
    core:
      build:
        ref: refs/tags/v0.0.10
        repository: https://github.com/cluebotng/external-core.git
        use_latest_versions: true
      run:
        command: ./cluebotng -l -m live_run
        port: 3565
        replicas: 2
    grafana-alloy:
      build:
        ref: refs/tags/v0.4.8
        repository: https://github.com/cluebotng/external-grafana-alloy.git
        use_latest_versions: true
      run:
        command: run-alloy
        health_check_http: /health
        mount: all
        port: 8118
    redis:
      build:
        ref: refs/tags/v1.0.6
        repository: https://github.com/cluebotng/external-redis.git
        use_latest_versions: true
      run:
        command: redis-server
        port: 6379
    report-interface:
      build:
        ref: refs/tags/v1.9.1
        repository: https://github.com/cluebotng/report.git
        use_latest_versions: true
      run:
        command: web
        health_check_http: /api/?action=health.check
        port: 8000
        replicas: 2
tools.cluebotng-staging@tools-bastion-15:~$ toolforge jobs dump
WARNING: unexpected key 'job_type' in job, omitting it from dump
WARNING: unexpected key 'job_type' in job, omitting it from dump
WARNING: unexpected key 'job_type' in job, omitting it from dump
WARNING: unexpected key 'job_type' in job, omitting it from dump
WARNING: unexpected key 'job_type' in job, omitting it from dump
- command: botng --irc-relay --debug
  continuous: true
  health-check-http: /health
  image: tool-cluebotng-staging/bot:latest@sha256:4123d094a4b42962404a4fba518c8bf2ab64ac1adaff57bfdaa8a2a6060621a0
  name: bot
  port: '8118'
- command: ./cluebotng -l -m live_run
  continuous: true
  image: tool-cluebotng-staging/core:latest@sha256:ab39ea540e3c65f90d9be86ae7e072dd536bdca13c55d6704c628dda9689aafc
  name: core
  port: '3565'
  replicas: 2
- command: run-alloy
  continuous: true
  health-check-http: /health
  image: tool-cluebotng-staging/grafana-alloy:latest@sha256:ce7ee26d7e1eb422059a21450c3c4e0dabc3de64b22cb3667a5c3706470551a7
  mount: all
  name: grafana-alloy
  port: '8118'
- command: redis-server
  continuous: true
  image: tool-cluebotng-staging/redis:latest@sha256:1b6b972d6baad6f7dcfc4997272e484e8b5eaba96eaf166cc9f6ff6eff21d4bb
  name: redis
  port: '6379'
- command: web
  continuous: true
  health-check-http: /api/?action=health.check
  image: tool-cluebotng-staging/report-interface:latest@sha256:c4fb7bd25058b230cfdb6ee68c0a30b50c803a2a324bf22e9b53f47cc5cbbfbf
  name: report-interface
  port: '8000'
  replicas: 2
tools.cluebotng-staging@tools-bastion-15:~$ toolforge jobs flush
tools.cluebotng-staging@tools-bastion-15:~$ toolforge components deployment create
Deployment for cluebotng-staging created successfully.
Deployment ID: 20260409-180511-p2a131uce6
Created: 20260409-180511

Builds:
  bot(pending): no-id-yet
  core(pending): no-id-yet
  grafana-alloy(pending): no-id-yet
  redis(pending): no-id-yet
  report-interface(pending): no-id-yet

Runs:
  bot(pending):
  core(pending):
  grafana-alloy(pending):
  redis(pending):
  report-interface(pending):

Results in a failure as expected:

tools.cluebotng-staging@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260409-180924-8xaiykgmii
Created: 20260409-180924
Status: failed
Long status:
  Got exception: Failed run for component bot: 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-staging/jobs/ (422): no details

Builds:
  bot(skipped): id:cluebotng-staging-buildpacks-pipelinerun-h2zhl Reusing existing build
  core(skipped): id:cluebotng-staging-buildpacks-pipelinerun-ww7dk Reusing existing build
  grafana-alloy(skipped): id:cluebotng-staging-buildpacks-pipelinerun-sxtpf Reusing existing build
  redis(skipped): id:cluebotng-staging-buildpacks-pipelinerun-4qdlw Reusing existing build
  report-interface(successful): id:cluebotng-staging-buildpacks-pipelinerun-cb7l4 You can see the logs with `toolforge build logs cluebotng-staging-buildpacks-pipelinerun-cb7l4`

Runs:
  bot(failed): 422 Client Error: Unprocessable Entity for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/v1/tool/cluebotng-staging/jobs/ (422): no details
  core(skipped): Skipped due to previous failure
  grafana-alloy(skipped): Skipped due to previous failure
  redis(skipped): Skipped due to previous failure
  report-interface(skipped): Skipped due to previous failure

Tool config:
  components:
    bot:
      build:
        ref: refs/tags/v0.0.40
        repository: https://github.com/cluebotng/botng.git
        use_latest_versions: true
      run:
        command: botng --irc-relay --debug
        health_check_http: /health
        port: 8118
    core:
      build:
        ref: refs/tags/v0.0.10
        repository: https://github.com/cluebotng/external-core.git
        use_latest_versions: true
      run:
        command: ./cluebotng -l -m live_run
        port: 3565
        replicas: 2
    grafana-alloy:
      build:
        ref: refs/tags/v0.4.8
        repository: https://github.com/cluebotng/external-grafana-alloy.git
        use_latest_versions: true
      run:
        command: run-alloy
        health_check_http: /health
        mount: all
        port: 8118
    redis:
      build:
        ref: refs/tags/v1.0.6
        repository: https://github.com/cluebotng/external-redis.git
        use_latest_versions: true
      run:
        command: redis-server
        port: 6379
    report-interface:
      build:
        ref: refs/tags/v1.9.1
        repository: https://github.com/cluebotng/report.git
        use_latest_versions: true
      run:
        command: web
        health_check_http: /api/?action=health.check
        port: 8000
        replicas: 2

Removing the health check creates a successful deployment:

tools.cluebotng-staging@tools-bastion-15:~$ toolforge components deployment show
Deployment ID: 20260409-181209-ig4hd1w23k
Created: 20260409-181209
Status: successful
Long status: 
  Finished at 2026-04-09 18:12:26.926490

Builds:
  bot(skipped): id:cluebotng-staging-buildpacks-pipelinerun-h2zhl Reusing existing build
  core(skipped): id:cluebotng-staging-buildpacks-pipelinerun-ww7dk Reusing existing build
  grafana-alloy(skipped): id:cluebotng-staging-buildpacks-pipelinerun-sxtpf Reusing existing build
  redis(skipped): id:cluebotng-staging-buildpacks-pipelinerun-4qdlw Reusing existing build
  report-interface(skipped): id:cluebotng-staging-buildpacks-pipelinerun-cb7l4 Reusing existing build

Runs:
  bot(successful): created or updated job bot, [info](Job bot created in storage and runtime)
  core(successful): created or updated job core, [info](Job core created in storage and runtime)
  grafana-alloy(successful): created or updated job grafana-alloy, [info](Job grafana-alloy created in storage and runtime)
  redis(successful): created or updated job redis, [info](Job redis created in storage and runtime)
  report-interface(successful): created or updated job report-interface, [info](Job report-interface created in storage and runtime)

Tool config:
  components:
    bot:
      build:
        ref: refs/tags/v0.0.40
        repository: https://github.com/cluebotng/botng.git
        use_latest_versions: true
      run:
        command: botng --irc-relay --debug
        port: 8118
    core:
      build:
        ref: refs/tags/v0.0.10
        repository: https://github.com/cluebotng/external-core.git
        use_latest_versions: true
      run:
        command: ./cluebotng -l -m live_run
        port: 3565
        replicas: 2
    grafana-alloy:
      build:
        ref: refs/tags/v0.4.8
        repository: https://github.com/cluebotng/external-grafana-alloy.git
        use_latest_versions: true
      run:
        command: run-alloy
        mount: all
        port: 8118
    redis:
      build:
        ref: refs/tags/v1.0.6
        repository: https://github.com/cluebotng/external-redis.git
        use_latest_versions: true
      run:
        command: redis-server
        port: 6379
    report-interface:
      build:
        ref: refs/tags/v1.9.1
        repository: https://github.com/cluebotng/report.git
        use_latest_versions: true
      run:
        command: web
        port: 8000
        replicas: 2

So this does impact existing deployments, effectively blocking new releases (and if the in-use build is expired then breaking the service).

Disabled http health checking in cluebot stuff (https://github.com/cluebotng/component-configs/commit/a97bfe791582e24f1c696f1bd89b965ea233c253) until this is fixed.

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1220

jobs-api: bump to 0.0.483-20260410112303-b93e57f2

DamianZaremba claimed this task.
DamianZaremba added a subscriber: Raymond_Ndibe.

Confirming my tools are now working (with http health checking) as expected.

Thanks @dcaro and @Raymond_Ndibe for sorting this quickly.