Page MenuHomePhabricator

Java jobs stop working
Closed, InvalidPublic

Description

Jobs I start with jstart often just stop working after a few days. Although there have been cases of memory problems, often the jobs stop without leaving anything useful in the logs (which might be related to https://phabricator.wikimedia.org/T85775). Here's how I start the jobs, simplified a bit:

jstart -N fr-feedcheck -mem 4500m java -Xmx180M -cp /path/to/wikipedia/languagetool-wikipedia.jar org.languagetool.dev.wikipedia.atom.AtomFeedCheckerCmd (+more options...)

I know that the jobs stop because they stop writing data to the DB and then the web app that uses this DB stops working. The same jobs used to be *much* more stable until around December 2014, I had jobs running for months without problems.

Related Objects

Event Timeline

dnaber raised the priority of this task from to Needs Triage.
dnaber updated the task description. (Show Details)
dnaber added a project: Toolforge.
dnaber added a subscriber: dnaber.
dnaber set Security to None.
dnaber triaged this task as High priority.Feb 17 2015, 7:53 AM

I've set this to priority 'high' now because the jobs need to run 24 hours per day to be really useful. Also, I have to log in about 5 times or so per week to restart the jobs. If we cannot solve this issue, I'll need to leave Tool Labs and host the jobs on my own machine again. I changed the memory settings a bit, e.g. to jstart -N fr-feedcheck -mem 7500m java -Xms250M -Xmx250M ...) but that doesn't help. Jobs still stop from time to time.

Please check the output of

qacct -j fr-feedcheck

to see what 'max vmem' is. Does it get up to 7500m?

In addition, is the job still running or not? If it's still running, it's not a memory issue.

Last, but not least, have you tested what your code does when the sql server connection breaks down? If the job is still running, but entries don't show up in the DB, this might suggest that reconnections don't work.

Are you still not seeing *any* logs at all? In that case, try specifying the log file directly, using -o /mypath/output.txt -e /mypath/error.txt

In about 90% the jobs are not running anymore. What's the correct behavior if the DB cannot be connected? Just crash and assume the process will automatically be restarted? Or manually try reconnecting after a wait period?

I do see some logs, but I didn't find useful information about why the jobs stop. I'll try the memory command later.

qacct -j fr-feedcheck | grep maxvmem | grep -v "0.000" gives this:

maxvmem      1.440G
maxvmem      1.706G
maxvmem      1.440G
maxvmem      1.441G
maxvmem      1.258G
maxvmem      1.453G
maxvmem      1.721G
maxvmem      1.386G
maxvmem      1.386G
maxvmem      1.437G
maxvmem      1.741G
maxvmem      1.692G
maxvmem      1.692G
maxvmem      1.488G
maxvmem      1.821G
maxvmem      1.467G
maxvmem      1.788G
maxvmem      1.848G
maxvmem      1.518G
maxvmem      1.736G
maxvmem      1.848G
maxvmem      1.522G
maxvmem      1.813G
maxvmem      1.736G
maxvmem      1.786G
maxvmem      1.518G
maxvmem      1.787G
maxvmem      1.784G
maxvmem      1.789G
maxvmem      1.850G
maxvmem      1.784G
maxvmem      1.951G
maxvmem      1.858G
maxvmem      1.682G

I'm not sure what to make of that. The job has been started with jstart -N fr-feedcheck -mem 7500m java -Xms250M -Xmx250M (...) which means the Java VM will not take more than 250M and if that's not enough it should cleanly exit with an OutOfMemoryError.

AFAIUI, -Xmx250M means that the Java application gets 250 MByte "memory"; the Java VM will add its own footprint to that. A test program:

public class HelloWorld 
{
 
       public static void main (String[] args)
           throws InterruptedException
       {
             System.out.println("Tired.");
             Thread.sleep(180 * 1000);
             System.out.println("Awake.");
       }
}

run with jsub -mem 2000m java -Xmx500m HelloWorld consumed:

[…]
maxvmem      1.902G
[…]
scfc claimed this task.

We're not running on Tool Labs anymore, so I cannot help in reproducing this. Feel free to close.