Page MenuHomePhabricator

Job getting killed on k8s
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
So9q
Jan 13 2022, 9:52 AM
Referenced Files
F34923372: bild.png
Jan 19 2022, 4:21 PM
F34919820: bild.png
Jan 16 2022, 12:51 PM
F34919787: bild.png
Jan 16 2022, 12:02 PM
F34919033: bild.png
Jan 15 2022, 11:58 AM
F34918589: bild.png
Jan 14 2022, 9:46 PM
F34916976: bild.png
Jan 13 2022, 12:44 PM
F34916827: bild.png
Jan 13 2022, 9:52 AM

Description

Jobs with large output from pandas seem to get killed consistently on k8s

bild.png (489×785 px, 40 KB)

List of steps to reproduce (step by step, including full links if applicable):

  • git clone https://github.com/dpriskorn/WikidataMLSuggester/
  • edit export_every_x_linenumber variable in extract-articles-from-swepub.py and set it to some huge number like 400000
  • run ./create_kubernettes_article_job_and_watch_the_log.sh 1
  • wait for it to get killed during the pd.to_pickle() call

What happens?:
the job gets killed and has "Killed" in the error log, see screenshot above

What should have happened instead?:
error that is easy for the user to understand and/or better documentation on the limits of the pods / k8s cluster

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

Event Timeline

During writing of the output the process was killed.
Evidence is here of partial write of the output file:

bild.png (653×1 px, 131 KB)

In T299121#7619402, @Majavah wrote:

No, but maybe related? This time the output was interrupted, but I could immediately start new jobs as usual.

Note this is low priority for me because I found a workaround and simply output a pickle for every x lines. Afterwards I can join them all easily to one big dataframe.

A job just got killed again.
This time I was extracting using this exact code: https://github.com/dpriskorn/WikidataMLSuggester/commit/8f411459cbf685852aea9a238719988e5ba0611e

started the job with ./create_kubernettes_article_job_and_watch_the_log.sh 1

It failed like this:

bild.png (673×950 px, 91 KB)

I am guessing that there is some kind of limit for how big a dataframe it can handle in memory/pickle to disk.

10000 seems to work, but 25000 as in this case did not work

bild.png (678×906 px, 91 KB)

after lowering to saving every 15000 lines it still gets killed.
10000 was ok so lowering to that.

still getting killed with 10000 as limit. hm.

bild.png (670×926 px, 91 KB)

lowered to saving every 1000 lines now. I hope that will solve the issue

It did not. It still got killed and I more or less gave up on running this on k8s. I just tried running it on the bastion instead and got this:

bild.png (152×1 px, 24 KB)

I'm curious how this killing of processes are governed. In a shared webhost at Dreamhost I once had a job killed, but then there was a clear message, so I could see why it was killed (long running wget jobs was not permitted).

aborrero moved this task from Backlog to Waiting for information on the Toolforge board.
bd808 claimed this task.
bd808 subscribed.

I'm curious how this killing of processes are governed.

The only thing that will make the system kill a job is running out of allocated memory. So the answer is that you are trying to stick more things in the container's allowed memory than will fit.

Perhaps try requesting more resources for the job, see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Job_quotas

This help page section describes the default per-job quota for CPU and RAM as well as how increase those values up to the absolute limit currently allowed for the tool.