Page MenuHomePhabricator

toolforge-jobs: Clean up old individual job objects
Closed, ResolvedPublicBUG REPORT

Description

In T285944#7194862, @Majavah wrote:
  • toolforge-jobs -h describes list and delete in terms of running jobs, but it looks like jobs stay in toolforge-jobs list after completing; are we expected to toolforge-jobs delete each job when we no longer care about it (or run toolforge-jobs flush)?

Sounds like a bug, since leaving them hanging around infinitely will create problems for the cluster as a whole.

It already cleans up old job objects for cron jobs, but not for individual jobs. Leaving them hanging around would likely create problems with kubernetes storage, so let's not do that.

Event Timeline

Jobs have .spec.ttlSecondsAfterFinished = 0 being set.

Looks like the Kubernetes feature to delete jobs with ttlSecondsAfterFinished is in Alpha in 1.18 (our current k8s version), so it's not enabled by default. It graduated to Beta and was enabled by default in 1.21.

Change 704958 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] kubeadm: kubelet: enable TTLAfterFinished feature gate

https://gerrit.wikimedia.org/r/704958

I wasn't uncomfortable with having to delete each job individually after completion. It allowed to review execution results and status.
But I totally understand the desire for them to be auto-cleaned up.

I've been reading the docs pointed by @Majavah. I'm not fan of having to modify kubelet arguments for this (see patch). I prefer to wait until 1.21 for this to be enabled by default without having to modify the kubelet config (which is in turn managed by kubeadm I believe).

I suspect we'll be on k8s 1.21 before we leave the beta phase for this.

I suspect we'll be on k8s 1.21 before we leave the beta phase for this.

How long do you expect the beta phase to last? I've just finished the 1.18 upgrade and based on that each upgrade takes a fair bit of time to perform, plus 1.20 removes a significant feature (Pod presets) and replacing it takes time too.

In T286108#7217298, @Majavah wrote:

I suspect we'll be on k8s 1.21 before we leave the beta phase for this.

How long do you expect the beta phase to last?

I don't know. Our desire would be something as fast as possible. My current estimation ranges from 3 months to 1 year, depending on what we'll want to see accomplished.

this can be an even bigger issue with failed jobs as well. See T251027: "signatures" tool has failed job pods on Kubernetes cluster

The garbage collector should protect the control plane from E_TOO_MANY_PODS, but it confuses users. Perhaps we should start recording all usage of non-GA APIs in our cluster on a wiki page so that we have an easier time of looking for deprecations on upgrades. Since the policy decision on betas (https://kubernetes.io/blog/2020/08/21/moving-forward-from-beta/#avoiding-permanent-beta) nothing is guaranteed except GA released APIs in k8s, after all.

Mentioned in SAL (#wikimedia-cloud) [2021-07-21T10:47:13Z] <arturo> enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)

Mentioned in SAL (#wikimedia-cloud) [2021-07-21T10:51:13Z] <arturo> enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)

Change 704958 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] kubeadm: enable TTLAfterFinished feature gate

https://gerrit.wikimedia.org/r/704958

Mentioned in SAL (#wikimedia-cloud) [2021-07-21T11:01:45Z] <arturo> enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)

Mentioned in SAL (#wikimedia-cloud) [2021-07-21T11:04:57Z] <arturo> enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)

Change 705864 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] devel/README.md: document TTLAfterFinished feature gate

https://gerrit.wikimedia.org/r/705864

Change 705865 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] jobs: adjust garbage collection

https://gerrit.wikimedia.org/r/705865

Change 705866 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-cli@master] wait: if the job doesn't exists it means it was already pruned by k8s

https://gerrit.wikimedia.org/r/705866

Change 705866 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-cli@master] wait: if the job doesn't exists it means it was already pruned by k8s

https://gerrit.wikimedia.org/r/705866

Change 705864 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] devel/README.md: document TTLAfterFinished feature gate

https://gerrit.wikimedia.org/r/705864

Change 705865 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] jobs: adjust garbage collection

https://gerrit.wikimedia.org/r/705865

aborrero claimed this task.