Page MenuHomePhabricator

tools.taxonbot and tools.giftbot cronjobs not firing
Closed, ResolvedPublic

Description

I am still missing the execution of some diverse cronjobs last days, and I am not very lucky about it, because fellow Wikipedians rely on me and the working of my bots to set time as in the crontab stated.
(No error outputs in .out files at all.)

But there is nothing more I can say about it.
/data/project/taxonbot

  • The cronjob beta-newday.tcl should start 0:00 UTC today, but it did not without any error report in beta-newday.out
  • The cronjob lkday.tcl should start 23:00 UTC January 6th and 7th, but it did not without any error report in lkday.out
    • but same time, 23:00 UTC, two other cronjobs did start: autoarchiv0.tcl and autoarchiv-voy.tcl
  • Running jobs didn't give back an error report this time in their .out log.

I hope, that is better, but I have very less informations to help you. I really know, that is very difficult, sorry for that!
Maybe it's not SGE but instable crontab?

Yes, this is already much more useful: it tells us which project we're talking about, which cronjob (although it would have been even better if you had added the actual command from the cronjob), and which files should contain information about these jobs. Finally, it would be useful to know whether you can submit them to the queue outside of cron successfully, and whether these commands can be run safely at other times of the day.
For what it's worth: I tried running beta-newday with the same command as in the crontab, which started without any issues. That suggests either something transient with the grid (but then you should have received an e-mail with an error, I think), or something with cron. From the cron logs, it indeed seems the job has not fired:

Jan  9 23:58:01 tools-submit CRON[23166]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g -N aan ^I^I^I^Iautoarchiv0.tcl)
Jan 10 00:00:31 tools-submit CRON[25797]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g^I^I^I^I^I^I^Ilog-log.tcl)
Jan 10 00:00:32 tools-submit CRON[25846]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g ^I^I^I^I^I^I^Iaa-vm.tcl)
Jan 10 00:00:32 tools-submit CRON[25856]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g -N aan ^I^I^I^Iautoarchiv0.tcl)
Jan 10 00:00:32 tools-submit CRON[25860]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g ^I^I^I^I^I^I^Ic-uncat.tcl)
Jan 10 00:02:01 tools-submit CRON[26393]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g^I^I^I^I^I^I^Ilog-log.tcl)

So: not an SGE issue, but another issue altogether.

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, doctaxon.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJan 10 2016, 12:58 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

I'm wondering if this could just be a load issue. It's a 1-cpu host, and this is the load average in the last 24 hours:

i.e. 'always above one, with peaks to 9 as midnight'.

Giftpflanze added a subscriber: Giftpflanze.EditedJan 10 2016, 1:15 PM

I have had the same problem (tools.giftbot):

TZ=Europe/Berlin

2016-01-03 23:00 UTC: 0 22,23 * * 0 [ $(date +\%-H) = 0 ] && jsub -once -j y -quiet -v LC_ALL=$LANG -mem 1g ausrufer.tcl

2016-01-03 23:00 UTC: 0 22,23 * * * [ $(date +\%-H) = 0 ] && jsub -once -j y -quiet -v LC_ALL=$LANG -mem 1g picdwba.tcl

Combined out and err files: ausrufer.out, picdwba.out
The jobs can be successfully submitted outside of cron.

doctaxon added a comment.EditedJan 12 2016, 1:53 PM

Here the next problem (tools.giftbot):

TZ=Europe/Berlin

0 22,23 * * 0 [ $(date +\%-H) = 0 ] && jsub -once -j y -quiet -v LC_ALL=$LANG -mem 1g ausrufer.tcl

Combined out and err file without any error line: ausrufer.out

The jobs can be successfully submitted outside of cron but was not submitted Jan. 10th, 23:00 UTC as stated in the crontab.

doctaxon renamed this task from tools.taxonbot cronjob not firing to tools.taxonbot and tools.giftbot cronjobs not firing.Jan 12 2016, 1:54 PM
doctaxon set Security to None.

2016-01-01 00:00 UTC and 2016-01-15 00:00 UTC: 0 0 1,15 * * jlocal ./dwla.sh

Here the next problem (tools.taxonbot):

JSUB_OPTIONS=-once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g 
0     0     *     * *   jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g          beta-newday.tcl

Combined out and err file without any error line: beta-newday.out

The jobs can be successfully submitted outside of cron but was not submitted Jan. 16th, 00:00 UTC as stated in the crontab.

Number of cronjobs for each minute of the day:

There's a huge peak at midnight, but no clear peak at 2300 UTC which suggests that might be a different issue.

In general, load is around the whole hours, then at every half hour and every 15 minute mark.

There's also a baseload of 20 cronjobs that fire every minute, which is probably not necessary, but also a different issue.

And only about 200 of them actually ran:

valhallasw@tools-submit:~$ sudo grep /var/log/syslog.1 -e "Jan 16 00:00" | grep -e "CMD" | wc -l
198

Syslog is full of

Jan 16 00:00:03 tools-submit nslcd[1033]: [2a3853] error reading from client: Timer expired
Jan 16 00:00:04 tools-submit CRON[13416]: Authentication failure
Jan 16 00:00:05 tools-submit nslcd[1033]: [3a5410] error reading from client: Timer expired
Jan 16 00:00:05 tools-submit nslcd[1033]: [2ed9c9] error reading from client: Timer expired
Jan 16 00:00:05 tools-submit nslcd[1033]: [82587d] error reading from client: Timer expired
Jan 16 00:00:05 tools-submit CRON[13414]: Authentication failure

etc. That happens 89 times (Auth failure) / 95 times (Timer expired), which brings the total back to 287 / 293 -- the actual number of cronjobs starting at 00:00 should be 285, so it doesn't completely add up.

The second potential issue (but I think that might be a longer term issue) is that the cron run for 00:00 takes almost 50 seconds. I'm not sure how well cron handles the situation where it takes longer than a minute...

scfc added a subscriber: scfc.Jan 16 2016, 10:27 PM

cron keeps a "virtual" time and should not drop jobs unless the difference becomes more than five minutes.

The thread at http://lists.arthurdejong.org/nss-pam-ldapd-users/2013/msg00001.html deals with "error reading from client: Timer expired" and mentions a large number of open file descriptors. This reminds me of the issues we ran into that with the proxy; e5be046a635f0772f0c0357d63fd9aca4b780acd then increased the limit. I'm wondering if cron or nslcd may have the same problem when they need to execute ~ 300 jobs.

Here the next problem again (tools.taxonbot):

JSUB_OPTIONS=-once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g 
0     0     *     * *   jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g          beta-newday.tcl

Combined out and err file without any error line: beta-newday.out

The jobs can be successfully submitted outside of cron but was not submitted Jan. 17th, 00:00 UTC as stated in the crontab.

Any possible opinions for solution of the problem (not sure about practicabillity):

  • Install a second cron and run cron1 and cron2, maybe run 2 cpu hosts independently, selection by crontab file.
  • Look for a better software handling more of such cronjobs per minute.

@doctaxon What if you tell cron to start 5 minutes later?

Sorry, the job has to start at midnight, 00:00 UTC

Start the job 5 minutes earlier and add a five minutes sleep in your language might help?

valhallasw added a comment.EditedJan 17 2016, 4:18 PM

The thread at http://lists.arthurdejong.org/nss-pam-ldapd-users/2013/msg00001.html deals with "error reading from client: Timer expired" (...)

I tried the suggested option of increasing the logging level of nslcd, but our installed version doesn't seem to allow the log syslog debug config option (http://arthurdejong.org/nss-pam-ldapd/nslcd.conf.5). That means the only debug option available is to run nslcd manually with -d.

nscd (the caching daemon) does allow setting debug level in nscd.conf. It does generate a lot of data (~2MB in 6 minutes), but it should be OK to leave it running until tomorrow. Hopefully that will give us some ideas on what's going wrong.

(config had

debug-level           0

and is now

debug-level           2
logfile               /var/log/nscd.log

)

As the errors also happen during the day:

Jan 17 16:30:10 tools-submit nslcd[1033]: [bc73ee] error reading from client: Timer expired
Jan 17 16:30:10 tools-submit nslcd[1033]: [fde2b5] error reading from client: Timer expired

It stopped recording and tried for see if there was anything obvious in nscd.log, but I couldn't find anything clearly indicating an issue.

One thing suggesting nscld is overloaded is given by top:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  851 ganglia   20   0 89336  19m 1336 S  2.0  1.0 607:58.37 gmond
 1033 nslcd     20   0  437m 6244 1236 S  0.0  0.3 388:01.82 nslcd
13211 root      20   0 64732  24m 2716 S 22.3  1.2 185:32.22 bigbrother
26550 tools.ad  20   0  131m  10m 3968 S  0.0  0.5 115:50.94 updatetools
  716 diamond   20   0  509m  22m 2964 S  0.0  1.1  26:01.88 diamond

(388 minutes) in 18 days. tools-login has 128 minutes logged in 32 days, to give a bit of a comparison.

scfc added a comment.Jan 17 2016, 7:38 PM

I tried to trigger the warnings with:

#!/usr/bin/python3

import grp
import multiprocessing
import pwd
import time

def f(x):
    return pwd.getpwnam(x)

if __name__ == '__main__':
    project_members = grp.getgrnam('project-tools')[3]
    p = multiprocessing.Pool(300)
    print(p.map(f, project_members))

on toolsbeta-bastion and tools-submit, to no avail, so I assume that nslcd can cope with the load?

Here the next problem again (tools.taxonbot):

JSUB_OPTIONS=-once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g 
0       */2     *   * *     jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g          aa-vm.tcl

Combined out and err file without any error line: aa-vm.out

The job can be successfully submitted outside of cron but was not submitted Jan. 19th, 12:00 UTC as stated in the crontab as */2.

Giftpflanze closed this task as Resolved.Feb 24 2016, 5:19 PM

Resolved for both of us.