Some bots on Toolforge such as eranbot, which powers CopyPatrol, run for a long time and can get "stuck". Our solution, albeit admittedly hacky, has been to set a maximum run time on the cron job with jsub … -l h_rt=4:05:00 -once …. This way it will automatically die after 4 hours and 5 minutes, then 5 minutes later it starts up again.
This feature appears to be missing in the new Toolforge jobs framework. Asking on IRC, I was informed Kubernetes has an option for that (.spec.activeDeadlineSeconds) and this could possibly be exposed in the wrapper API.
Additionally, we also need something similar to the -once flag, but apparently the .spec.parallelism option is set to 1 by default so this may effectively already be the case with the Toolforge jobs framework. However some may wish to increase this number, or have the concurrency policy changed, so that may be worth implementing in the wrapper API as well.
See T376099: --timeout flag for mwscript-k8s for a similar feature request in a different Kubernetes environment. Here a timeout option using activeDeadlineSeconds in Kubernetes was implemented. That option is currently missing in Toolforge job submission.
I used to use this a lot on the grid to automatically have jobs killed that got stuck. Having this option in the job submission would make that possible again.