Page MenuHomePhabricator

Grid jobs from Toolforge killed, unknown cause?
Closed, DeclinedPublic

Description

The hourly jobs that update https://tools.wmflabs.org/snapshots/ seem to get killed before they finish quite often.

I find that once a month or so, it gets killed, and after that point all next ones also die before they finish, which means the snapshots don't update any more.

I then run the script manually (once) from tools-login, and after that the hourly updates work again.

crontab
0 * * * * /usr/bin/jsub -N snapshots-updateSnaphots -once -quiet -release trusty -mem 2048m ~/update.sh
update.sh
php /data/project/snapshots/src/mwSnapshots/scripts/updateSnaphots.php > /data/project/snapshots/src/mwSnapshots/logs/updateSnaphots.log 2>&1

Whenever I find that snapshots are stale, I see that the log file is incomplete for the last run, and that there are no running or queued jobs in qstat. Then, when the crontab triggers (or I run jsub manually) I can reproduce the same thing. The job will start, the log file starts, and then after some minutes, the job stops.

I'm aware of qacct but I have been unable to get a response to any of the queries I send it. I'm unable to get a result for recent jobs by name. And after submitting a new one and trying to query it by job ID, I still don't get a response. The command just hangs indefinitely.

Tried:

$ qacct -d 1 -j snapshots-updateSnaphots
$ qacct -j 2495263
$ qacct -j 2495263 -o tools.snapshots

Using https://tools.wmflabs.org/?status to track the job manually I find that on that page it also appears, and then disappears.

Event Timeline

I'm aware of qacct but I have been unable to get a response to any of the queries I send it. [...] The command just hangs indefinitely.

It ate the NFS. It runs for a long but finite time :( If you really insist on reading the logs, either somehow tail -c (might need a custom parser like the one for tool grid-jobs), or run it from an idle host such as tools-bastion-05 to not affect other users.

We talked about this problem a bit on irc. The likely cause of death is exceeding the memory limit of the job. After talking this over it may be that git gc is the cause of this which could explain why the failure seems to show up kind of randomly and then go away after a manual run compresses the index(es).

The problem of qacct being slow to the point of uselessness and also a trigger for overloading the NFS servers is separate, but real. The Cloud Services team has talked about changes to how the audit logs are stored and rotated that might make this better.

@bd808 Thanks. Is there a task for the general problem with (or solution for) qacct? If so, feel free to close this in favour of that.