Job getting killed on k8s
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	So9q
	Jan 13 2022, 9:52 AM

Description

Jobs with large output from pandas seem to get killed consistently on k8s

List of steps to reproduce (step by step, including full links if applicable):

git clone https://github.com/dpriskorn/WikidataMLSuggester/
edit export_every_x_linenumber variable in extract-articles-from-swepub.py and set it to some huge number like 400000
run ./create_kubernettes_article_job_and_watch_the_log.sh 1
wait for it to get killed during the pd.to_pickle() call

What happens?:
the job gets killed and has "Killed" in the error log, see screenshot above

What should have happened instead?:
error that is easy for the user to understand and/or better documentation on the limits of the pods / k8s cluster

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

Related Objects

Mentioned Here: T299039: All started jobs failed on Kubernetes during 24h with no visible error or output

Event Timeline

So9q created this task.Jan 13 2022, 9:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2022, 9:52 AM

So9q updated the task description. (Show Details)Jan 13 2022, 9:58 AM

During writing of the output the process was killed.
Evidence is here of partial write of the output file:

So9q updated the task description. (Show Details)Jan 13 2022, 12:45 PM

Duplicate of T299039: All started jobs failed on Kubernetes during 24h with no visible error or output?

In T299121#7619402, @Majavah wrote:

Duplicate of T299039: All started jobs failed on Kubernetes during 24h with no visible error or output?

No, but maybe related? This time the output was interrupted, but I could immediately start new jobs as usual.

Note this is low priority for me because I found a workaround and simply output a pickle for every x lines. Afterwards I can join them all easily to one big dataframe.

A job just got killed again.
This time I was extracting using this exact code: https://github.com/dpriskorn/WikidataMLSuggester/commit/8f411459cbf685852aea9a238719988e5ba0611e

started the job with ./create_kubernettes_article_job_and_watch_the_log.sh 1

It failed like this:

I am guessing that there is some kind of limit for how big a dataframe it can handle in memory/pickle to disk.

10000 seems to work, but 25000 as in this case did not work

after lowering to saving every 15000 lines it still gets killed.
10000 was ok so lowering to that.

still getting killed with 10000 as limit. hm.

still gets killed. hm

lowered to saving every 1000 lines now. I hope that will solve the issue

It did not. It still got killed and I more or less gave up on running this on k8s. I just tried running it on the bastion instead and got this:

I'm curious how this killing of processes are governed. In a shared webhost at Dreamhost I once had a job killed, but then there was a clear message, so I could see why it was killed (long running wget jobs was not permitted).

aborrero added a parent task: T285944: Toolforge: beta phase for the new jobs framework.Feb 16 2022, 10:47 AM

Perhaps try requesting more resources for the job, see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Job_quotas

aborrero triaged this task as Low priority.Apr 11 2022, 11:10 AM

aborrero moved this task from Backlog to Waiting for information on the Toolforge board.

bd808 edited projects, added Toolforge Jobs framework; removed Toolforge.May 31 2022, 8:10 PM

aborrero removed a parent task: T285944: Toolforge: beta phase for the new jobs framework.Jan 24 2023, 4:38 PM

In T299121#7633057, @So9q wrote:

I'm curious how this killing of processes are governed.

The only thing that will make the system kill a job is running out of allocated memory. So the answer is that you are trying to stick more things in the container's allowed memory than will fit.

In T299121#7844453, @aborrero wrote:

Perhaps try requesting more resources for the job, see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Job_quotas

This help page section describes the default per-job quota for CPU and RAM as well as how increase those values up to the absolute limit currently allowed for the tool.

Restricted Application added a project: User-bd808. · View Herald TranscriptJan 5 2024, 2:06 AM

	F34923372: bild.png
	Jan 19 2022, 4:21 PM

	F34919820: bild.png
	Jan 16 2022, 12:51 PM

	F34919787: bild.png
	Jan 16 2022, 12:02 PM

	F34919033: bild.png
	Jan 15 2022, 11:58 AM

Job getting killed on k8sClosed, ResolvedPublicBUG REPORTActions

Description

Related Objects

Event Timeline

Job getting killed on k8s
Closed, ResolvedPublicBUG REPORT
Actions