"signatures" tool has failed job pods on Kubernetes cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Apr 25 2020, 5:16 PM

Description

$ kubectl get pod -o wide | awk '!/Running|Completed/ {n=split($8, host, "."); print $4, host[1], $2, $1}'
RESTARTS NOMINATED READY NAME
0 <none> 0/1 signatures.sigprobs-5glgs
0 <none> 0/1 signatures.sigprobs-cfsnl
0 <none> 0/1 signatures.sigprobs-hdvqx
0 <none> 0/1 signatures.sigprobs-ltbnj
0 <none> 0/1 signatures.sigprobs-nggqc
$ kubectl logs signatures.sigprobs-5glgs
Traceback (most recent call last):
  File "/data/project/signatures/signatures/src/sigprobs.py", line 461, in <module>
    handle_args()
  File "/data/project/signatures/signatures/src/sigprobs.py", line 430, in handle_args
    with output_file(output, hostname, args.overwrite) as f:
  File "/data/project/signatures/signatures/src/sigprobs.py", line 453, in output_file
    return out_file.open("w") if overwrite else out_file.open("x")
  File "/usr/lib/python3.7/pathlib.py", line 1186, in open
    opener=self._opener)
  File "/usr/lib/python3.7/pathlib.py", line 1039, in _opener
    return self._accessor.open(self, flags, mode)
FileExistsError: [Errno 17] File exists: '/data/project/signatures/signatures/data/en.wikipedia.org.json'

It looks like the job description that fired off these pods has been deleted, so maybe there is nothing else to do here? I'm not sure at what point Kubernetes will clean up the failed pods however.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• JHedden	T251027 "signatures" tool has failed job pods on Kubernetes cluster
Resolved		aborrero	T251917 Design the Jobs service in k8s
Resolved		aborrero	T274123 introduce TLS support in toolsbeta front proxy
Resolved		aborrero	T274139 Decide TLS auth proxy method for the new toolforge jobs framework
Resolved		• Bstorm	T275478 toolsbeta: ingress admission controller doesn't accept valid FQDN patterns
Resolved		aborrero	T283238 Toolforge: develop jobs-framework-api
Resolved		aborrero	T285944 Toolforge: beta phase for the new jobs framework
Resolved		JJMC89	T285961 jobs-framework-cli: display something when involved with no arguments
Resolved		aborrero	T285963 jobs framework should not error out when executed by a non-tool account
Resolved	BUG REPORT	aborrero	T285979 toolforge-jobs with --wait hangs indefinitely if the job fails
Resolved	BUG REPORT	aborrero	T286107 toolforge-jobs: Allow specifying arguments to commands
Resolved	BUG REPORT	aborrero	T286108 toolforge-jobs: Clean up old individual job objects
Resolved	Feature	aborrero	T286126 toolforge-jobs: allow setting limits and requests
Resolved	BUG REPORT	aborrero	T286132 toolforge-jobs: "Status: Unknown" when job is running
Resolved	Feature	aborrero	T286135 Toolforge jobs framework: email maintainers on job failure
Resolved		aborrero	T286485 toolforge-jobs: figure out logging
Resolved		aborrero	T286492 toolforge-jobs: load jobs from a file
Resolved		• Bstorm	T286784 toolforge-jobs: figure out default quotas and limits
Invalid	BUG REPORT	None	T286824 toolforge-jobs cli: commands are missing or truncated in list/show output
Resolved	Feature	aborrero	T287077 toolforge-jobs: Indicate which containers are deprecated
Resolved		taavi	T287275 toolforge-jobs: reject jobs with more resource requests than single pods can use
Resolved	BUG REPORT	So9q	T299039 All started jobs failed on Kubernetes during 24h with no visible error or output
Resolved		aborrero	T301665 Toolforge jobs framework: create documentation on wikitech
Resolved	Feature	Raymond_Ndibe	T301901 Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework
Resolved	Feature	Raymond_Ndibe	T302211 toolforge-jobs: merge stdout/stderr output
Duplicate		None	T302988 Toolforge kubernetes and pywikibot
Duplicate	Feature	Raymond_Ndibe	T304421 Allow customizing the out/err files with toolforge-jobs
Resolved	BUG REPORT	JJMC89	T304861 Typo in toolforge-jobs output
Resolved	BUG REPORT	Raymond_Ndibe	T304893 Rethink job retries in case of failures
Resolved	BUG REPORT	aborrero	T304900 toolforge-jobs should properly process 'out of quota' errors
Resolved	BUG REPORT	taavi	T305592 Underscore in job name gives non-helpful error in toolforge-jobs
Resolved	Feature	taavi	T306311 toolforge-jobs should have a method to easily restart a specific job
Resolved	Feature	taavi	T306312 toolforge-jobs should have a method to load a single task (or a subset of tasks) from a yaml file
Duplicate	BUG REPORT	None	T306454 toolforge-jobs rejects valid schedule time
Duplicate	Feature	None	T307486 add mysql to toolforge-bullseye-standalone
Resolved		taavi	T309146 Perl Docker image not available for Toolforge Kubernetes jobs
Resolved		Raymond_Ndibe	T315114 Make it possible to configure retry policy for jobs executed on the toolforge jobs framework
Resolved		aborrero	T327278 Toolforge jobs framework: define end of beta phase

Event Timeline

bd808 created this task.Apr 25 2020, 5:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2020, 5:16 PM

https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-termination-and-cleanup says:

When a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status

@Bstorm, this makes me wonder if we should add an admission controller that adds/enforces spec.activeDeadlineSeconds and spec.ttlSecondsAfterFinished settings for all jobs. Thoughts?

bd808 added a project: cloud-services-team (Kanban).Apr 25 2020, 5:22 PM

bd808 moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

The pod failure was caused by me forgetting that I had changed the default value for something and then not closely supervising the job.

The job description is generated by sigprops_start.py so that I can more easily pass parameters into the job. I thought I had a small spec.backoffLimit in there already, but I didn't. I've added that and a reasonable spec.ttlSecondsAfterFinished now.

spec.activeDeadlineSeconds wouldn't help with this problem, as it'll still fail the job and leave the pods. I'm not sure enforcing it would be ideal, as there are legitimate reasons to run a job of indefinite length that stops completely on failure (instead of being restarted like a Deployment).

A mutating and/or validating admission controller for jobs may end up being necessary, yeah, but I think we should build some tooling around jobs before/concurrent with when we try to do that.

Overall, I think jobs and cronjobs need a bit of design and abstraction. The blank check they provide (while the pods managed by them are not blank checks) is bound to make this confusing. The GC settings and quotas should prevent Total Annihilation from bad jobs settings, so I'm not immensely worried. I just think we need to make jobs and cronjobs more straightforward than this whole DIY setup or they'll not be very supportable or widely adoptable.

In T251027#6086222, @Bstorm wrote:

I just think we need to make jobs and cronjobs more straightforward than this whole DIY setup or they'll not be very supportable or widely adoptable.

+1 for sure to some helper tooling.

There will always be some folks who walk around that guard rail and straight into the raw Kubernetes experience even with tooling though. I haven't done any experiments to see if the quotas actually apply to complete/failed jobs. I guess that's my initial concern in a nutshell: can jobs "leak" resources by not being counted in quota but still keeping metadata and/or images active on the Kubernetes workers?

In T251027#6086607, @bd808 wrote:

There will always be some folks who walk around that guard rail and straight into the raw Kubernetes experience even with tooling though. I haven't done any experiments to see if the quotas actually apply to complete/failed jobs. I guess that's my initial concern in a nutshell: can jobs "leak" resources by not being counted in quota but still keeping metadata and/or images active on the Kubernetes workers?

They should not be able to leak running pods (but we can test), and the garbage collector settings will clean up old job pods, which are basically a memory thing on the control plane and disk space on the nodes since the job isn't running anymore (we set that to terminated-pod-gc-threshold: 5000 # Default is 12500 based on some napkin math that I cannot remember). Also, the container logs are limited, etc. Pods that stop like jobs controlled ones and pods that end up in a crash loop or similar are like this. The garbage collector, the job definition and other things all come into play (including the log limits). I'm interested in testing what the quota on pods does with dead pods that aren't running if we just create a pile of them.

The feature ttlSecondsAfterFinished is in alpha. It will not work without the feature gate being enabled on the API server and maybe the scheduler (the docs say it doesn't anyway, but they have been wrong on some of these things). We might need to enable that before making k8s jobs a major product we promote--presuming it works right.

Overall, I'm inclined to say that we should build tooling and a validating controller that enforces that the tooling is used or tightly replicated. If there are enough people using it already, we could design it in the controllers first and the tooling second, I guess. I'm not very worried about large resource leaks, but the mechanism is not built with a lot of stops on it in k8s, in general. Thus why I think the tooling is really important--partly to shore up that issue.

• JHedden closed this task as Resolved.May 5 2020, 4:21 PM

• JHedden claimed this task.

• Bstorm mentioned this in T251917: Design the Jobs service in k8s.May 5 2020, 6:28 PM

• Bstorm mentioned this in T273487: Confusion about the os & package versions used on Toolforge bastions & grid vs Kubernetes containers.Feb 1 2021, 6:03 PM

aborrero closed subtask T251917: Design the Jobs service in k8s as Resolved.May 20 2021, 11:37 AM

• Bstorm mentioned this in T286108: toolforge-jobs: Clean up old individual job objects.Jul 16 2021, 5:12 PM