Page MenuHomePhabricator

Make SGE be more informative about OOM kills
Closed, DeclinedPublic

Description

There are recurring inquiries why jobs "stopped working all of a sudden". In most cases, SGE killed them because they exceeded their requested memory limit.

It would probably be nice to send a mail à la:

Your job #4711 was aborted because it exceeded its requested memory limit of x MByte.

If you want to increase the memory requested, please see https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#something_helpful.

If you have questions, please see https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#something_about_irc_channels_and_stuff.

to the job owner when a job is aborted due to memory exhaustion.

Event Timeline

scfc raised the priority of this task from to Low.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.
scfc subscribed.

If we provide qsub with -ma (mail on abort/reschedule) by default, and maybe continuous jobs also -me (mail on end of job, i.e. the continuous job crashed), the user will at least get an e-mail, although it's not overly informative:

Job 3383350 (dplinks-erwin85) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = nlwikibots
 Queue            = short-sol@willow.toolserver.org
 Host             = willow
 Start Time       = 12/23/2013 05:00:11
 End Time         = 12/23/2013 05:30:15
 CPU              = 00:00:01
 Max vmem         = 25.285M
failed assumedly after job because:
job 3383350.1 died through signal KILL (9)

(from my email archive)

However, it's an e-mail /and/ it shows max vmem, so it's at least somewhat clearer what happened. Maybe the exit status also tells something about the cause?

See http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html for qsub details.

SGE indeed does offer something like that, and I proposed enabling it in T52053 (you could even accomplish continuous jobs without wrapper scripts by using epilog scripts that return an special exit code that signals that the job needs to be restarted :-)).

But would users who are surprised by the OOM kill read the mail with due diligence or just complain? :-)

I think adding a mail function to /usr/local/bin/jobkill is not much effort and promises some gain.

dcaro subscribed.

The grid has been deprecated upstream and is not going to be getting any more developer time in the future, consider moving to the kubernetes backend instead.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!