Feature summary (what you would like to be able to do and where):
Specify the interval, timeout, rise and fall for a continuous job's health check.
Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
With a deployed job:
tools.cluebot3@tools-bastion-13:~$ toolforge jobs dump - command: run-bot continuous: true cpu: '1.0' health-check-script: health-check image: tool-cluebot3/reviewer:latest mem: 1.0Gi name: cluebot3 replicas: 1
A liveness probe is configured in kubernetes, based on health-check-script:
tools.cluebot3@tools-bastion-13:~$ kubectl get pod cluebot3-795c89584-hzctd -o json | jq '.spec.containers[0].livenessProbe'
{
"exec": {
"command": [
"/bin/sh",
"-c",
"health-check"
]
},
"failureThreshold": 3,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 5
}The settings associated to the command (threshold, period, timeout) are not user configurable.
The health check logic queries data from enwiki's API, which apparently sometimes takes longer than 5 seconds (x3) to respond.
This causes the healthcheck to timeout and the bot to be restarted:
tools.cluebot3@tools-bastion-13:~$ kubectl events LAST SEEN TYPE REASON OBJECT MESSAGE 59m (x637 over 7d15h) Warning Unhealthy Pod/cluebot3-795c89584-hzctd Liveness probe failed: command "/bin/sh -c health-check" timed out
It would be perfectly adequate to run this check at an hourly interval with at least a minute timeout.
Benefits (why should this be implemented?):
- Reduce resource usage within toolsforge
- Correctly apply health checking based on the application behaviour