Some bots on Toolforge such as eranbot, which powers CopyPatrol, run for a long time and can get "stuck". Our solution, albeit admittedly hacky, has been to set a maximum run time on the cron job with jsub … -l h_rt=4:05:00 -once …. This way it will automatically die after 4 hours and 5 minutes, then 5 minutes later it starts up again.
This feature appears to be missing in the new Toolforge jobs framework. Asking on IRC, I was informed Kubernetes has an option for that (.spec.activeDeadlineSeconds) and this could possibly be exposed in the wrapper API.
Additionally, we also need something similar to the -once flag, but apparently the .spec.parallelism option is set to 1 by default so this may effectively already be the case with the Toolforge jobs framework. However some may wish to increase this number, or have the concurrency policy changed, so that may be worth implementing in the wrapper API as well.