Page MenuHomePhabricator

Open Grid Engine Job dumps core (node)
Closed, ResolvedPublic

Description

I've been trying to use Open Grid Engine to manage a job for a bot in the anon Tools-Labs project. I could see in anon.err that the job was getting dumped cores and was advised by YuviPanda in Cloud-Services to start the job on trusty.tools.wmflabs.org like this:

jstart -l release=trusty -N anon /usr/bin/node anon/anon.js --config /data/project/anon/config.json --verbose

Unfortunately I'm still seeing errors in /data/project/anon/anon.err

FATAL ERROR: v8::Context::New() V8 is no longer usable
/var/spool/gridengine/execd/tools-exec-12/job_scripts/7335380: line 4:  7662 Aborted                 (core dumped) /usr/bin/nodejs /data/project/anon/anon/anon.js --config /data/project/anon/config.json --verbose

What's weird is that running the job manually on trusty.tools.wmflabs.org with the command in the error seems to work fine:

/usr/bin/nodejs anon/anon.js --config /data/project/anon/config.json --verbose

Please advise. I currently have this bot running manually outside of grid, but it would be nice to have it run like other bots.

Event Timeline

edsu raised the priority of this task from to Needs Triage.
edsu updated the task description. (Show Details)
edsu added a project: Toolforge.
edsu subscribed.
edsu triaged this task as Lowest priority.Jan 16 2015, 1:31 PM
edsu set Security to None.

Usually, when something works interactively, but not as a grid job, this is due to the job running out of memory as the default limit (256 MByte) is often insufficient and some programs have a very confusing way to report this resource deficit. Could you please try starting the job with -mem 512M, -mem 1G, etc. to see if that solves the issue?

Yes, this seemed to work:

jstart -N anon -mem 1g /usr/bin/node anon/anon.js --config /data/project/anon/config.json --verbose

honestly, I had no idea that it was using that much memory. Is it ok to run bot jobs that use that much memory?

In general, that is not a problem. If your jobs were continuously using hundreds of gigabytes, there would need to be a discussion if your tool/bot could be optimized or new hardware resources must be added (or, absolute worst case, the tool/bot be disabled if the other options are not feasible), but 1 GByte is usually far below the radar.

scfc claimed this task.