Page MenuHomePhabricator

[jobs-api] Allow configuring health check timeout
Open, MediumPublicFeature

Description

Feature summary (what you would like to be able to do and where):

Specify the interval, timeout, rise and fall for a continuous job's health check.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

With a deployed job:

tools.cluebot3@tools-bastion-13:~$ toolforge jobs dump
- command: run-bot
  continuous: true
  cpu: '1.0'
  health-check-script: health-check
  image: tool-cluebot3/reviewer:latest
  mem: 1.0Gi
  name: cluebot3
  replicas: 1

A liveness probe is configured in kubernetes, based on health-check-script:

tools.cluebot3@tools-bastion-13:~$ kubectl get pod cluebot3-795c89584-hzctd -o json | jq '.spec.containers[0].livenessProbe'
{
  "exec": {
    "command": [
      "/bin/sh",
      "-c",
      "health-check"
    ]
  },
  "failureThreshold": 3,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 5
}

The settings associated to the command (threshold, period, timeout) are not user configurable.

The health check logic queries data from enwiki's API, which apparently sometimes takes longer than 5 seconds (x3) to respond.

This causes the healthcheck to timeout and the bot to be restarted:

tools.cluebot3@tools-bastion-13:~$ kubectl events
LAST SEEN               TYPE      REASON      OBJECT                         MESSAGE
59m (x637 over 7d15h)   Warning   Unhealthy   Pod/cluebot3-795c89584-hzctd   Liveness probe failed: command "/bin/sh -c health-check" timed out

It would be perfectly adequate to run this check at an hourly interval with at least a minute timeout.

Benefits (why should this be implemented?):

  • Reduce resource usage within toolsforge
  • Correctly apply health checking based on the application behaviour

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

As a data point, this is also happening on ClueBot NG which is more impactful

Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Warning  Unhealthy  50m                      kubelet  Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4393d9d217932f852b7a9a6e313461a7ff3014ebed6cd0ddfcfc2d7afca3b64e": OCI runtime exec failed: exec failed: unable to start container process: error executing setns process: exit status 1: unknown
  Normal   Pulled     38m (x22 over 6d10h)     kubelet  Successfully pulled image "tools-harbor.wmcloud.org/tool-cluebotng/bot:latest" in 248ms (248ms including waiting)
  Warning  Unhealthy  15m (x42 over 6d8h)      kubelet  Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: container is in CONTAINER_EXITED state
  Warning  Unhealthy  4m (x811 over 6d11h)     kubelet  Liveness probe failed: command "/bin/sh -c health-check" timed out
  Normal   Pulling    3m59s (x784 over 6d12h)  kubelet  Pulling image "tools-harbor.wmcloud.org/tool-cluebotng/bot:latest"
fnegri triaged this task as Medium priority.Sep 5 2025, 8:37 AM