Page MenuHomePhabricator

[jobs-api,jobs-cli] Support job health checks
Closed, ResolvedPublicFeature

Description

Sometimes jobs can get stuck or stalled for no particular reason and Kubernetes may not realize it if there is no liveness status checks.

We could explore how to introduce some simple (and optional) liveness check for jobs that developers can use in their jobs to prevent this.

Related Objects

Event Timeline

bd808 changed the subtype of this task from "Task" to "Feature Request".
bd808 added a subscriber: taavi.
taavi renamed this task from Toolforge jobs: consider having a way for jobs to report their liveness status to kubernetes to Support job health checks.Feb 28 2024, 12:09 PM
Raymond_Ndibe changed the task status from Open to In Progress.Mar 4 2024, 5:30 PM
dcaro triaged this task as Medium priority.Mar 5 2024, 9:36 AM
dcaro moved this task from In Review to In Progress on the Toolforge (Toolforge iteration 06) board.
dcaro renamed this task from Support job health checks to [jobs-api,jobs-cli] Support job health checks.Mar 11 2024, 11:47 AM
dcaro removed a project: Toolforge Jobs framework.

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/231

jobs-api: bump to 0.0.271-20240403154350-2940c48f

@Raymond_Ndibe I think this feature deserves a section on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework and an email to cloud@ letting folks know it is possible now and any special things that they should look out for as they try to write their own check script.

@Raymond_Ndibe I think this feature deserves a section on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework and an email to cloud@ letting folks know it is possible now and any special things that they should look out for as they try to write their own check script.

Questions I have right now are where does the script actually end up running (inside the live Pod or elsewhere?) and can the script somehow see the state of things inside the live Pod including processes, envvars, and file system things?

Re-opening since I think documentation needs to be added to https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework for this to be considered complete.

Also T348755: [jobs-api,webservice] Run webservices via the jobs framework will need a HTTP probe, should I file a separate task for that?

Re-opening since I think documentation needs to be added to https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework for this to be considered complete.

👍

Also T348755: [jobs-api,webservice] Run webservices via the jobs framework will need a HTTP probe, should I file a separate task for that?

I think it's implicit, as in we have to support everything that is currently supported (including the http probe).
It might be implemented before though, when we get services in the continuous jobs.

hey, I just noticed the dumps operation now shows an invalid YAML that cannot be loaded back:

local.tf-test@lima-lima-kilo:~$ toolforge jobs dump
- command: ./test-cmd.sh
  continuous: true
  health_check: null
  image: bookworm
  name: test
  no-filelog: 'true'

Note the health_check: null entry. This needs fixing soon-ish.

hey, I just noticed the dumps operation now shows an invalid YAML that cannot be loaded back:

local.tf-test@lima-lima-kilo:~$ toolforge jobs dump
- command: ./test-cmd.sh
  continuous: true
  health_check: null
  image: bookworm
  name: test
  no-filelog: 'true'

Note the health_check: null entry. This needs fixing soon-ish.

it seems to only be a problem for jobs with health check (the others just complain but get created anyhow).

Looks like a rebase issue (two branch merged after the other without proper rebase on top of each other)

dcaro changed the task status from Open to In Progress.Apr 8 2024, 12:33 PM
dcaro moved this task from Done to In Progress on the Toolforge (Toolforge iteration 08) board.

@Raymond_Ndibe I think this feature deserves a section on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework and an email to cloud@ letting folks know it is possible now and any special things that they should look out for as they try to write their own check script.

Questions I have right now are where does the script actually end up running (inside the live Pod or elsewhere?) and can the script somehow see the state of things inside the live Pod including processes, envvars, and file system things?

@bd808 the script will be executed inside the pod. you can either provide an inline script (--health-check-script "echo this-is-a-script") or create a script file and do --health-check-script ./script.sh (goes without saying but you need to make the script executable). In both cases the running script can view the state of things inside the pod (processes, envvars, etc).

Re-opening since I think documentation needs to be added to https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework for this to be considered complete.

Also T348755: [jobs-api,webservice] Run webservices via the jobs framework will need a HTTP probe, should I file a separate task for that?

I think the idea is to wait until webservice is part of jobs-framework before adding tcp and http health-checks

@bd808 the script will be executed inside the pod. you can either provide an inline script (--health-check-script "echo this-is-a-script") or create a script file and do --health-check-script ./script.sh (goes without saying but you need to make the script executable). In both cases the running script can view the state of things inside the pod (processes, envvars, etc).

Thanks for these details @Raymond_Ndibe. If you use a script file, can that script file be a part of the job's custom build service manged image or does the file need to be readable by the toolforge jobs script on the bastion when the job is configured?

@bd808 the script will be executed inside the pod. you can either provide an inline script (--health-check-script "echo this-is-a-script") or create a script file and do --health-check-script ./script.sh (goes without saying but you need to make the script executable). In both cases the running script can view the state of things inside the pod (processes, envvars, etc).

Thanks for these details @Raymond_Ndibe. If you use a script file, can that script file be a part of the job's custom build service manged image or does the file need to be readable by the toolforge jobs script on the bastion when the job is configured?

the script doesn't need to be readable by toolforge jobs. the only requirement is that when kubelet logs into the container it should find a script with the provided name and the right permissions, so yeaa being part of the image is enough

@Raymond_Ndibe I think this feature deserves a section on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework and an email to cloud@ letting folks know it is possible now and any special things that they should look out for as they try to write their own check script.

I just updated https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework and sent out announcement email to cloud-announce (though the email is waiting for moderators approval, the judging by the feedback email I got)

I think we can mark this as resolved now @taavi

marking as resolved. We can open it again if anyone disagrees