Page MenuHomePhabricator

"signatures" tool has failed job pods on Kubernetes cluster
Closed, ResolvedPublic

Description

$ kubectl get pod -o wide | awk '!/Running|Completed/ {n=split($8, host, "."); print $4, host[1], $2, $1}'
RESTARTS NOMINATED READY NAME
0 <none> 0/1 signatures.sigprobs-5glgs
0 <none> 0/1 signatures.sigprobs-cfsnl
0 <none> 0/1 signatures.sigprobs-hdvqx
0 <none> 0/1 signatures.sigprobs-ltbnj
0 <none> 0/1 signatures.sigprobs-nggqc
$ kubectl logs signatures.sigprobs-5glgs
Traceback (most recent call last):
  File "/data/project/signatures/signatures/src/sigprobs.py", line 461, in <module>
    handle_args()
  File "/data/project/signatures/signatures/src/sigprobs.py", line 430, in handle_args
    with output_file(output, hostname, args.overwrite) as f:
  File "/data/project/signatures/signatures/src/sigprobs.py", line 453, in output_file
    return out_file.open("w") if overwrite else out_file.open("x")
  File "/usr/lib/python3.7/pathlib.py", line 1186, in open
    opener=self._opener)
  File "/usr/lib/python3.7/pathlib.py", line 1039, in _opener
    return self._accessor.open(self, flags, mode)
FileExistsError: [Errno 17] File exists: '/data/project/signatures/signatures/data/en.wikipedia.org.json'

It looks like the job description that fired off these pods has been deleted, so maybe there is nothing else to do here? I'm not sure at what point Kubernetes will clean up the failed pods however.

Related Objects

StatusSubtypeAssignedTask
Resolved JHedden
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
ResolvedJJMC89
Resolvedaborrero
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTaborrero
ResolvedFeatureaborrero
ResolvedBUG REPORTaborrero
ResolvedFeatureaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
InvalidBUG REPORTNone
ResolvedFeatureaborrero
Resolvedtaavi
ResolvedBUG REPORTSo9q
Resolvedaborrero
ResolvedFeatureRaymond_Ndibe
ResolvedFeatureRaymond_Ndibe
DuplicateNone
DuplicateFeatureRaymond_Ndibe
ResolvedBUG REPORTJJMC89
ResolvedBUG REPORTRaymond_Ndibe
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTtaavi
ResolvedFeaturetaavi
ResolvedFeaturetaavi
DuplicateBUG REPORTNone
DuplicateFeatureNone
Resolvedtaavi
ResolvedRaymond_Ndibe
Resolvedaborrero

Event Timeline

https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-termination-and-cleanup says:

When a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status

@Bstorm, this makes me wonder if we should add an admission controller that adds/enforces spec.activeDeadlineSeconds and spec.ttlSecondsAfterFinished settings for all jobs. Thoughts?

The pod failure was caused by me forgetting that I had changed the default value for something and then not closely supervising the job.

The job description is generated by sigprops_start.py so that I can more easily pass parameters into the job. I thought I had a small spec.backoffLimit in there already, but I didn't. I've added that and a reasonable spec.ttlSecondsAfterFinished now.

spec.activeDeadlineSeconds wouldn't help with this problem, as it'll still fail the job and leave the pods. I'm not sure enforcing it would be ideal, as there are legitimate reasons to run a job of indefinite length that stops completely on failure (instead of being restarted like a Deployment).

A mutating and/or validating admission controller for jobs may end up being necessary, yeah, but I think we should build some tooling around jobs before/concurrent with when we try to do that.

Overall, I think jobs and cronjobs need a bit of design and abstraction. The blank check they provide (while the pods managed by them are not blank checks) is bound to make this confusing. The GC settings and quotas should prevent Total Annihilation from bad jobs settings, so I'm not immensely worried. I just think we need to make jobs and cronjobs more straightforward than this whole DIY setup or they'll not be very supportable or widely adoptable.

I just think we need to make jobs and cronjobs more straightforward than this whole DIY setup or they'll not be very supportable or widely adoptable.

+1 for sure to some helper tooling.

There will always be some folks who walk around that guard rail and straight into the raw Kubernetes experience even with tooling though. I haven't done any experiments to see if the quotas actually apply to complete/failed jobs. I guess that's my initial concern in a nutshell: can jobs "leak" resources by not being counted in quota but still keeping metadata and/or images active on the Kubernetes workers?

There will always be some folks who walk around that guard rail and straight into the raw Kubernetes experience even with tooling though. I haven't done any experiments to see if the quotas actually apply to complete/failed jobs. I guess that's my initial concern in a nutshell: can jobs "leak" resources by not being counted in quota but still keeping metadata and/or images active on the Kubernetes workers?

They should not be able to leak running pods (but we can test), and the garbage collector settings will clean up old job pods, which are basically a memory thing on the control plane and disk space on the nodes since the job isn't running anymore (we set that to terminated-pod-gc-threshold: 5000 # Default is 12500 based on some napkin math that I cannot remember). Also, the container logs are limited, etc. Pods that stop like jobs controlled ones and pods that end up in a crash loop or similar are like this. The garbage collector, the job definition and other things all come into play (including the log limits). I'm interested in testing what the quota on pods does with dead pods that aren't running if we just create a pile of them.

The feature ttlSecondsAfterFinished is in alpha. It will not work without the feature gate being enabled on the API server and maybe the scheduler (the docs say it doesn't anyway, but they have been wrong on some of these things). We might need to enable that before making k8s jobs a major product we promote--presuming it works right.

Overall, I'm inclined to say that we should build tooling and a validating controller that enforces that the tooling is used or tightly replicated. If there are enough people using it already, we could design it in the controllers first and the tooling second, I guess. I'm not very worried about large resource leaks, but the mechanism is not built with a lot of stops on it in k8s, in general. Thus why I think the tooling is really important--partly to shore up that issue.

JHedden claimed this task.