Page MenuHomePhabricator

log files not written
Closed, InvalidPublic

Description

Some of my tasks started with e.g. jstart -N ca-feedcheck -mem 1800m java (...) don't write log files, although they are running: I'd expect ca-feedcheck.out and ca-feedcheck.err to be in my home dir, and for some tasks this work, for others it sometimes doesn't. I don't know how to reproduce, but it has happened more than once. It *might* be related to the fact that tool labs has sometimes problems starting my tasks, as the JVM crashes due to memory problems but is then restarted.

Event Timeline

dnaber raised the priority of this task from to Needs Triage.
dnaber updated the task description. (Show Details)
dnaber added a project: Toolforge.
dnaber added a subscriber: dnaber.

If I look at the logs for that job (grep /var/lib/gridengine/default/common/accounting for the job numbers and feed them to qacct -j), they show that the jobs exited with status 143, which is 128 + 15, with 15 being SIGTERM. This is very much in alignment with the job going out of memory and then being killed by /usr/local/bin/jobkill (SIGINT -> SIGTERM -> SIGKILL).

But AFAIK JVMs request all the memory at the start and then don't request more. Are you calling external program from within your Java program?

JVMs can also request memory after start, depending on how you start them. I've now added -Xms100M to make sure they start with the maximim amount, i.e. no more memory requests.

Another thing that may help with memory is to use -jamvm, a different jvm implementation that is considerably more frugal with its memory allocation.

I forget to mention that I restart the jobs every 24 hours with e.g. qmod -rj ca-feedcheck. Could this be a problem? Will this keep the memory settings?

I think qmod -rj will keep the memory settings, but nevertheless it will kill the "previous" job with -TERM, if the process does not react to the "milder" signals.

However, it also kills my desire to debug this further :-). The more complex your setup, the more likely it is that some of it inadvertently moved files around or killed jobs rather than the grid engine running amok. Without steps to reproduce it is impossible to determine where the error (if any) lies.

scfc claimed this task.

It's unclear to me if there is an issue at all, and if yes, if it is caused by the Tools infrastructure. Please reopen if you encounter new problems.