It already cleans up old job objects for cron jobs, but not for individual jobs. Leaving them hanging around would likely create problems with kubernetes storage, so let's not do that.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • JHedden | T251027 "signatures" tool has failed job pods on Kubernetes cluster | |||
Resolved | aborrero | T251917 Design the Jobs service in k8s | |||
Resolved | aborrero | T283238 Toolforge: develop jobs-framework-api | |||
Resolved | aborrero | T285944 Toolforge: beta phase for the new jobs framework | |||
Resolved | BUG REPORT | aborrero | T286108 toolforge-jobs: Clean up old individual job objects |
Event Timeline
Looks like the Kubernetes feature to delete jobs with ttlSecondsAfterFinished is in Alpha in 1.18 (our current k8s version), so it's not enabled by default. It graduated to Beta and was enabled by default in 1.21.
Change 704958 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] kubeadm: kubelet: enable TTLAfterFinished feature gate
I wasn't uncomfortable with having to delete each job individually after completion. It allowed to review execution results and status.
But I totally understand the desire for them to be auto-cleaned up.
I've been reading the docs pointed by @Majavah. I'm not fan of having to modify kubelet arguments for this (see patch). I prefer to wait until 1.21 for this to be enabled by default without having to modify the kubelet config (which is in turn managed by kubeadm I believe).
I suspect we'll be on k8s 1.21 before we leave the beta phase for this.
How long do you expect the beta phase to last? I've just finished the 1.18 upgrade and based on that each upgrade takes a fair bit of time to perform, plus 1.20 removes a significant feature (Pod presets) and replacing it takes time too.
I don't know. Our desire would be something as fast as possible. My current estimation ranges from 3 months to 1 year, depending on what we'll want to see accomplished.
this can be an even bigger issue with failed jobs as well. See T251027: "signatures" tool has failed job pods on Kubernetes cluster
The garbage collector should protect the control plane from E_TOO_MANY_PODS, but it confuses users. Perhaps we should start recording all usage of non-GA APIs in our cluster on a wiki page so that we have an easier time of looking for deprecations on upgrades. Since the policy decision on betas (https://kubernetes.io/blog/2020/08/21/moving-forward-from-beta/#avoiding-permanent-beta) nothing is guaranteed except GA released APIs in k8s, after all.
Mentioned in SAL (#wikimedia-cloud) [2021-07-21T10:47:13Z] <arturo> enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)
Mentioned in SAL (#wikimedia-cloud) [2021-07-21T10:51:13Z] <arturo> enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)
Change 704958 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] kubeadm: enable TTLAfterFinished feature gate
Mentioned in SAL (#wikimedia-cloud) [2021-07-21T11:01:45Z] <arturo> enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)
Mentioned in SAL (#wikimedia-cloud) [2021-07-21T11:04:57Z] <arturo> enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)
Change 705864 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[cloud/toolforge/jobs-framework-api@main] devel/README.md: document TTLAfterFinished feature gate
Change 705865 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[cloud/toolforge/jobs-framework-api@main] jobs: adjust garbage collection
Change 705866 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[cloud/toolforge/jobs-framework-cli@master] wait: if the job doesn't exists it means it was already pruned by k8s
Change 705866 merged by jenkins-bot:
[cloud/toolforge/jobs-framework-cli@master] wait: if the job doesn't exists it means it was already pruned by k8s
Change 705864 merged by jenkins-bot:
[cloud/toolforge/jobs-framework-api@main] devel/README.md: document TTLAfterFinished feature gate
Change 705865 merged by jenkins-bot:
[cloud/toolforge/jobs-framework-api@main] jobs: adjust garbage collection
Mentioned in SAL (#wikimedia-cloud) [2021-07-21T11:58:38Z] <arturo> deploying jobs-framework-api 07346d715d17585db9c16dd152cc91ef0bea33c3 (T286108)
Mentioned in SAL (#wikimedia-cloud) [2021-07-21T11:59:12Z] <arturo> deploying jobs-framework-api 07346d715d17585db9c16dd152cc91ef0bea33c3 (T286108)