Page MenuHomePhabricator

tools.suggestbot web requests fail after a period of time
Closed, ResolvedPublic

Description

As reported on labs-l[1], SuggestBot's (tools.suggestbot) web server fails to be reachable after a period of time. The web service is reported as up and running and is reachable locally[2], but not from the outside (e.g. http://tools.wmflabs.org/suggestbot/hw.html fails with a 503 service unreachable).

Restarting the web service fixes the problem but only for 20-or-so minutes, after which 503s are reported again.

References:
1: https://lists.wikimedia.org/pipermail/labs-l/2016-April/004427.html
2: https://lists.wikimedia.org/pipermail/labs-l/2016-April/004430.html

BTW, I'll post to labs-l to reflect that this ticket has been created.

Event Timeline

Nettrom created this task.Apr 19 2016, 7:23 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptApr 19 2016, 7:23 PM
Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald Transcript

Thanks for creating this ticket, Morten. The 20 minute period is really weird to me -- we've seen issues with a race condition where, when a webservice restarts, the new webservice first registers with the proxy, and the old webservice then deregisters the new registration -- but that was an issue of seconds, not minutes.

T122515: Possible race condition in webservice HSET/HDEL

I've dug a bit into the proxy; we indeed see registration and deregistration happening:

HSET
$17
prefix:suggestbot
$2
.*
$25
http://10.68.23.222:32971
*4
--
HDEL
$17
prefix:suggestbot
$2
.*
*3
$4

What I've done now is killing the job (qdel ...), and restarting with webservice restart. That spaces the stop and start further in time, and that seems to have resolved the issue... for now, at least.

Nettrom added a comment.EditedApr 20 2016, 12:31 AM

I unfortunately have to report that the problem wasn't resolved, when I checked a couple of hours after @valhallasw's update, 503s were again reported.

I tried a similar approach, killing the job with qdel, but chose to start it with webservice start (instead of "restart"). So far I haven't seen a 503 yet, but I have also noticed that the service is restarted quite often. Not sure if @valhallasw has been fiddling with it, or if there's another reason why it's happening. I've added some info about the jobs below, just in case it's useful to look them up with qacct. If there's a race-condition upon restarts, then these frequent restarts of the service might be triggering them more often than we suspect.

Original service that was failing when I checked, the deletion, and startup:

tools.suggestbot@tools-bastion-02:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
5497712 0.30023 lighttpd-s tools.sugges r     04/19/2016 19:43:43 webgrid-lighttpd@tools-webgrid     1        
tools.suggestbot@tools-bastion-02:~$ qdel 5497712
tools.suggestbot has registered the job 5497712 for deletion
tools.suggestbot@tools-bastion-02:~$ qstat
tools.suggestbot@tools-bastion-02:~$ webservice --release trusty start
Starting web service.
tools.suggestbot@tools-bastion-02:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
5500996 0.30000 lighttpd-s tools.sugges r     04/19/2016 22:21:15 webgrid-lighttpd@tools-webgrid     1

I've since seen the following jobs when checking with qstat over the past couple of hours or so:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
5501440 0.30001 lighttpd-s tools.sugges r     04/19/2016 22:46:08 webgrid-lighttpd@tools-webgrid     1        
5502027 0.30000 lighttpd-s tools.sugges r     04/19/2016 23:16:35 webgrid-lighttpd@tools-webgrid     1        
5502302 0.30000 lighttpd-s tools.sugges r     04/19/2016 23:32:18 webgrid-lighttpd@tools-webgrid     1        
5503247 0.30000 lighttpd-s tools.sugges r     04/20/2016 00:17:25 webgrid-lighttpd@tools-webgrid     1
bd808 added a subscriber: bd808.Apr 20 2016, 1:17 AM

tools.suggestbot's ~/service.log shows a lot of restarts as well:

2016-04-19T22:32:14.457170 No running webservice job found, attempting to start it
2016-04-19T22:45:58.119421 No running webservice job found, attempting to start it
2016-04-19T23:16:25.405171 No running webservice job found, attempting to start it
2016-04-19T23:32:08.378894 No running webservice job found, attempting to start it
2016-04-20T00:17:15.972044 No running webservice job found, attempting to start it
2016-04-20T00:31:59.576238 No running webservice job found, attempting to start it
2016-04-20T00:46:41.672873 No running webservice job found, attempting to start it

There are roughly matching entries in ~/error.log as seen with grep 'server stopped' error.log

2016-04-19 22:32:20: (server.c.1558) server stopped by UID = 0 PID = 8232
2016-04-19 22:46:06: (server.c.1558) server stopped by UID = 0 PID = 8668
2016-04-19 23:16:32: (server.c.1558) server stopped by UID = 0 PID = 22819
2016-04-19 23:32:16: (server.c.1558) server stopped by UID = 0 PID = 503
2016-04-20 00:17:23: (server.c.1558) server stopped by UID = 0 PID = 15808
2016-04-20 00:32:06: (server.c.1558) server stopped by UID = 0 PID = 11068
2016-04-20 00:46:48: (server.c.1558) server stopped by UID = 0 PID = 22831

The time offsets (stopped messages shortly after service thought to be not running) here makes me wonder if service watcher is actually causing the problem.

I haven't seen many 503s over the past couple of days, so I regard this particular task as resolved. At the same time, I logged 71 restarts of the web service yesterday (see below), so that appears to be something that needs looking at. Should I open a separate ticket for that?

tools.suggestbot@tools-bastion-02:~$ grep '2016-04-20' service.log | grep "No running webservice" | wc -l
71
bd808 added a comment.Apr 21 2016, 4:08 PM

I haven't seen many 503s over the past couple of days, so I regard this particular task as resolved. At the same time, I logged 71 restarts of the web service yesterday (see below), so that appears to be something that needs looking at. Should I open a separate ticket for that?

tools.suggestbot@tools-bastion-02:~$ grep '2016-04-20' service.log | grep "No running webservice" | wc -l
71

That certainly seems abnormal to me. The question I suppose is what is causing it to die so often? What does your ~/error.log look like for the same time period? Does it have the same "server stopped" pattern that I saw in T133090#2221574?

There are 71 "server stopped" patterns in ~/error.log as well. All of them have timestamps within 6–12s (mean 7.89s) after each of the timestamps in the service log. So yes, it's the same pattern as mentioned in T133090#2221574.

I'm not sure if this ticket should be closed now. While I rarely experience issues with accessing my web services, the problem with the server being restarted appears to persist. I wasn't sure if the deployment of T98440 would change anything, for May 1–4 I see no significant change with the web service being restarted 80, 74, 70, and 73 times, respectively. All of these have the matching pattern mentioned in T133090#2221574, although the average time before restart is down to 5s.

scfc added a subscriber: scfc.May 6 2016, 3:43 AM

I just looked at error.log for which the last relevant line was:

2016-05-06 03:16:39: (server.c.1558) server stopped by UID = 0 PID = 11824

Then I looked up the corresponding grid job (6023951), logged onto that machine (tools-webgrid-lighttpd-1401), found that 11824 (last) referred to a process of sge_shepherd by:

scfc@tools-webgrid-lighttpd-1401:~$ sudo dump-acct /var/log/account/pacct | fgrep 11824
mandb           |v3|     0.00|     0.00|     0.00|     6|    12| 18808.00|     0.00|   11824    11697|Thu May  5 09:49:30 2016
dpkg            |v3|     0.00|     0.00|     0.00|     0|     0| 12224.00|     0.00|   11827    11824|Thu May  5 14:18:54 2016
dpkg            |v3|     0.00|     0.00|     0.00|     0|     0| 12224.00|     0.00|   11828    11824|Thu May  5 14:18:54 2016
dpkg            |v3|     0.00|     0.00|     0.00|     0|     0| 12224.00|     0.00|   11829    11824|Thu May  5 14:18:54 2016
apt-cache       |v3|    13.00|     7.00|    21.00|     0|     0| 28984.00|     0.00|   11824    10529|Thu May  5 14:18:54 2016
hostname        |v3|     0.00|     0.00|     0.00|     0|     0|  7472.00|     0.00|   11824    11822|Thu May  5 18:47:01 2016
locale          |v3|     0.00|     0.00|     0.00| 52456| 52456|  4352.00|     0.00|   11825    11824|Thu May  5 22:30:59 2016
bash            |v3|     0.00|     0.00|     0.00| 52456| 52456| 12648.00|     0.00|   11826    11824|Thu May  5 22:30:59 2016
create-mime.ass |v3|     0.00|     0.00|     1.00| 52456| 52456| 17880.00|     0.00|   11828    11824|Thu May  5 22:30:59 2016
lighttpd        |v3|     4.00|     4.00|    16.00| 52456| 52456| 19536.00|     0.00|   11824    11822|Thu May  5 22:30:59 2016
basename        |v3|     0.00|     0.00|     0.00|     0|     0|  4340.00|     0.00|   11825    11824|Fri May  6 02:46:15 2016
lighttpd        |v3|     9.00|    30.00|182386.00| 51172| 51172| 22816.00|     0.00|   11826    11824|Fri May  6 02:46:16 2016
portreleaser    |v3|     3.00|     0.00|    11.00| 51172| 51172| 31224.00|     0.00|   15791    11824|Fri May  6 03:16:39 2016
sge_shepherd    |v3|     0.00|     0.00|182399.00|     0|     0| 20488.00|     0.00|   11824     1280|Fri May  6 02:46:16 2016
scfc@tools-webgrid-lighttpd-1401:~$

maxvmem for the job was 2.687G (qacct -j 6023951) which would be below the standard memory limit of 4G, but my gut still says that the webservice OOMed.

I have created /data/project/.system/config/suggestbot.web-memlimit with the content 6G and webservice restart. This will either decrease the likelihood that the error occurs or it will do nothing.

Meanwhile I take a look where SGE logs who ordered a job to be killed.

Nettrom added a comment.EditedMay 6 2016, 5:12 PM

Memory usage has been one of my concerns as SuggestBot's link recommender is an iterative breadth-first search from a set of articles, and that's what is mostly used. However, the script (~tools.suggestbot/link-rec/link-recommender.py) is reasonably memory-optimized as it used to run on the Toolserver, which had a 1GB memory limit.

I did some testing with the Python memory_profiler module, but it seems to be ill-suited for this kind of process. When checking the VMEM-usage of the process while it's running, though, it didn't report surprisingly high memory usage (this is towards the end of the test, keeping scores for about 100k pages, which I suspect is kind of typical):

tools.suggestbot@tools-bastion-02:~$ cat /proc/6317/status
Name:	python
State:	R (running)
Tgid:	6317
Ngid:	0
Pid:	6317
PPid:	25786
TracerPid:	0
Uid:	51172	51172	51172	51172
Gid:	51172	51172	51172	51172
FDSize:	256
Groups:	51172 
VmPeak:	  245060 kB
VmSize:	  212124 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	   33648 kB
VmRSS:	   31104 kB
VmData:	   99500 kB
VmStk:	     136 kB
VmExe:	    2800 kB
VmLib:	    9380 kB
VmPTE:	     228 kB
VmSwap:	       0 kB
Threads:	1
SigQ:	0/63706
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000001001000
SigCgt:	0000000180000002
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000001fffffffff
Seccomp:	0
Cpus_allowed:	f
Cpus_allowed_list:	0-3
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	28437
nonvoluntary_ctxt_switches:	136740

In other words: I have no reason to believe that the underlying script is using surprisingly much memory while running.

Also, comparing usage through access.log with restarts listed in service.log and error.log, the web service is restarted many times even though there's no usage. The last access currently is "06/May/2016:12:11:57", but the web service has been restarted 13 times since then, about three times every hour. I noticed that qacct reports about 2.7G of max VMEM usage, but actual memory usage is low, here's an example:

tools.suggestbot@tools-bastion-02:~$ qacct -b 05061212 -o tools.suggestbot -j "lighttpd-suggestbot"
==============================================================
qname        webgrid-lighttpd    
hostname     tools-webgrid-lighttpd-1401.eqiad.wmflabs
group        tools.suggestbot    
owner        tools.suggestbot    
project      NONE                
department   defaultdepartment   
jobname      lighttpd-suggestbot 
jobnumber    6037880             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Fri May  6 12:15:29 2016
start_time   Fri May  6 12:15:29 2016
end_time     Fri May  6 12:32:10 2016
granted_pe   NONE                
slots        1                   
failed       0    
exit_status  0                   
ru_wallclock 1001         
ru_utime     0.112        
ru_stime     0.251        
ru_maxrss    10668               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    5655                
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   24                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     3859                
ru_nivcsw    149                 
cpu          1.230        
mem          0.147             
io           0.003             
iow          0.000             
maxvmem      2.673G
arid         undefined

Hopefully some of this can give you more of an idea about what's going on.

bd808 renamed this task from Web requests fail after a period of time to tools.suggestbot web requests fail after a period of time.May 25 2016, 4:19 PM
chasemp triaged this task as Medium priority.May 31 2016, 3:28 PM
chasemp added a subscriber: chasemp.

still happening?

With regards to the web service being restarted: I count 80 "No running webservice" notifications in ~suggestbot/service.log for 2016-05-30. There are a handful of requests in the access.log for the same day, so it seems that the service is still restarted for no apparent reason.

The last known issue this has caused for SuggestBot was 4 days ago, but I have not added code to specifically track when the web service isn't available.

scfc moved this task from Triage to Backlog on the Toolforge board.Dec 5 2016, 4:14 AM
scfc added a comment.Dec 5 2016, 4:37 PM

I just looked at the latest job that failed (tail -10000 /var/lib/gridengine/default/common/accounting | fgrep tools.suggestbot):

webgrid-lighttpd:tools-webgrid-lighttpd-1405.eqiad.wmflabs:tools.suggestbot:tools.suggestbot:lighttpd-suggestbot:741530:sge:0:1480952801:1480952802:1480954658:0:0:1856:0.392033:0.334405:10400.000000:0:0:0:0:5590:0:0:0.000000:24:0:0:0:1918:41:NONE:defaultdepartment:NONE:1:0:1.960000:0.471716:0.001963:-u tools.suggestbot -q webgrid-lighttpd -l h_vmem=6G,release=trusty:0.000000:NONE:2497482752.000000:0:0

I then logged into tools-webgrid-lighttpd-1405 and looked at /var/spool/gridengine/execd/tools-webgrid-lighttpd-1405/messages; last entries were:

12/05/2016 10:44:44|  main|tools-webgrid-lighttpd-1405|W|job 731334 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1405.eqiad.wmflabs" (7002099712.00000 > limit:7000000000.00000) - sending SIGKILL
12/05/2016 11:16:37|  main|tools-webgrid-lighttpd-1405|W|job 733037 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1405.eqiad.wmflabs" (7053651968.00000 > limit:7000000000.00000) - sending SIGKILL
12/05/2016 12:26:21|  main|tools-webgrid-lighttpd-1405|W|job 734428 exceeds job hard limit "h_vmem" of queue "webgrid-lighttpd@tools-webgrid-lighttpd-1405.eqiad.wmflabs" (7001972736.00000 > limit:7000000000.00000) - sending SIGKILL

741530 stopped ~ 16:17Z, so the issue is not OOM-related. Next assumption: webservicemonitor does something unexpected. I'll add some logging and test.

scfc added a comment.Dec 5 2016, 5:23 PM

It is indeed a problem with webservicemonitor. When it decides to restart the webservice for tools.suggestbot, it actually has qstat's output as (inter alia):

<job_list state="running">
  <JB_job_number>744181</JB_job_number>
  <JAT_prio>0.30000</JAT_prio>
  <JB_name>lighttpd-suggestbot</JB_name>
  <JB_owner>tools.suggestbot</JB_owner>
  <state>r</state>
  <JAT_start_time>2016-12-05T17:14:44</JAT_start_time>
  <queue_name>webgrid-lighttpd@tools-webgrid-lighttpd-1405.eqiad.wmflabs</queue_name>
  <slots>1</slots>
</job_list>

webservicemonitor should treat that as "running".

scfc added a comment.Dec 5 2016, 7:01 PM

… and it does treat it as "running", but registered_webservices does not contain suggestbot at those times. Hmmm.

scfc added a comment.Dec 5 2016, 7:59 PM

I disabled webservicemonitor for suggestbot at 19:39Z, and at 19:46Z something deleted the proxy entry on tools-proxy-01 (/var/lib/redis/tools-proxy-01-6379.aof):

Mon Dec  5 19:45:06 UTC 2016: $4
Mon Dec  5 19:45:06 UTC 2016: HSET
Mon Dec  5 19:45:06 UTC 2016: $15
Mon Dec  5 19:45:06 UTC 2016: prefix:zppixbot
Mon Dec  5 19:45:06 UTC 2016: $2
Mon Dec  5 19:45:06 UTC 2016: .*
Mon Dec  5 19:45:06 UTC 2016: $25
Mon Dec  5 19:45:06 UTC 2016: http://192.168.0.105:8000
Mon Dec  5 19:46:53 UTC 2016: *3
Mon Dec  5 19:46:53 UTC 2016: $4
Mon Dec  5 19:46:53 UTC 2016: HDEL
Mon Dec  5 19:46:53 UTC 2016: $17
Mon Dec  5 19:46:53 UTC 2016: prefix:suggestbot
Mon Dec  5 19:46:53 UTC 2016: $2
Mon Dec  5 19:46:53 UTC 2016: .*
Mon Dec  5 19:50:19 UTC 2016: *3
Mon Dec  5 19:50:19 UTC 2016: $4
Mon Dec  5 19:50:19 UTC 2016: HDEL
Mon Dec  5 19:50:19 UTC 2016: $13
Mon Dec  5 19:50:19 UTC 2016: prefix:jembot
Mon Dec  5 19:50:19 UTC 2016: $2
scfc added a comment.Dec 5 2016, 8:21 PM

webservice start → proxy entry created and web server running on tools-webgrid-lighttpd-1416

Mon Dec  5 20:17:29 UTC 2016: *3
Mon Dec  5 20:17:29 UTC 2016: $4
Mon Dec  5 20:17:29 UTC 2016: HDEL
Mon Dec  5 20:17:29 UTC 2016: $17
Mon Dec  5 20:17:29 UTC 2016: prefix:suggestbot
Mon Dec  5 20:17:29 UTC 2016: $2
Mon Dec  5 20:17:29 UTC 2016: .*

qstat shows the job (749477) still running and indeed on tools-webgrid-lighttpd-1416 nothing has changed. So who's deleting the proxy entry?!

scfc claimed this task.Dec 5 2016, 9:17 PM

Ouch! It's so "obvious" (probably): suggestbot runs a couple of tasks per crontab by directly calling qsub. These get executed on tools-webgrid-lighttpd-*:

scfc@tools-bastion-03:~$ tail -1000 /var/lib/gridengine/default/common/accounting | fgrep tools.suggestbot 
webgrid-lighttpd:tools-webgrid-lighttpd-1405.eqiad.wmflabs:tools.suggestbot:tools.suggestbot:lighttpd-suggestbot:748325:sge:0:1480966323:1480966324:1480968682:0:0:2358:0.817020:0.627196:10404.000000:0:0:0:0:5591:0:0:0.000000:24:0:0:0:2416:59:NONE:defaultdepartment:NONE:1:0:3.680000:0.863973:0.001963:-u tools.suggestbot -q webgrid-lighttpd -l h_vmem=6G,release=trusty:0.000000:NONE:2497482752.000000:0:0
webgrid-lighttpd:tools-webgrid-lighttpd-1406.eqiad.wmflabs:tools.suggestbot:tools.suggestbot:opentask:749552:sge:0:1480968909:1480968910:1480969049:0:0:139:5.895742:6.989252:34944.000000:0:0:0:0:28903:0:0:0.000000:32:0:0:0:62758:1271:NONE:defaultdepartment:NONE:1:0:12.884994:2.074287:0.007350:-u tools.suggestbot -l h_rt=900,h_vmem=1024M,release=trusty:0.000000:NONE:226693120.000000:0:0
webgrid-lighttpd:tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs:tools.suggestbot:tools.suggestbot:lighttpd-suggestbot:749477:sge:0:1480968693:1480968693:1480969395:0:0:702:0.151359:0.153474:10404.000000:0:0:0:0:5601:0:0:0.000000:24:0:0:0:766:41:NONE:defaultdepartment:NONE:1:0:1.070000:0.187252:0.001953:-u tools.suggestbot -q webgrid-lighttpd -l h_vmem=6G,release=trusty:0.000000:NONE:2485153792.000000:0:0
webgrid-generic:tools-webgrid-generic-1401.eqiad.wmflabs:tools.suggestbot:tools.suggestbot:opentask:750078:sge:0:1480969861:1480969862:1480969903:0:0:41:3.281221:3.705838:36472.000000:0:0:0:0:30306:0:0:0.000000:40:0:0:0:66226:573:NONE:defaultdepartment:NONE:1:0:6.987059:1.119782:0.007612:-u tools.suggestbot -l h_rt=900,h_vmem=1024M,release=trusty:0.000000:NONE:267497472.000000:0:0
scfc@tools-bastion-03:~$

These hosts are referred to in qconf -sq webgrid-lighttpd, and also as epilog /usr/local/bin/portreleaser is specified:

scfc@tools-grid-master:/var/lib/gridengine$ qconf -sq webgrid-lighttpd
qname                 webgrid-lighttpd
hostlist              @webgrid
seq_no                2
load_thresholds       np_load_avg=2.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               NONE
rerun                 TRUE
slots                 256
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                /usr/local/bin/portreleaser
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      SIGTERM
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
scfc@tools-grid-master:/var/lib/gridengine$

So my assumption is that SGE executes epilog not only for tasks that have been assigned to the queue webgrid-lighttpd, but to all tasks that are executed on a host associated with a queue. Once any such task finishes, it removes the proxy entry.

I have added -q task arguments to the active crontab at 20:47Z, and there have been no restarts since. I will undo my debugging changes, treating suggestbot in webservicemonitor as any other tool, and then determine tomorrow whether the change was successful. If it was, /data/project/suggestbot/crontab would need to be amended as well as I think this is a local copy of the active crontab.

scfc moved this task from Backlog to In Progress on the Toolforge board.Dec 6 2016, 12:02 AM
scfc closed this task as Resolved.Dec 6 2016, 9:55 PM

/data/project/suggestbot/service.log shows no restarts after:

2016-12-05T19:32:01.791443 No running webservice job found, attempting to start it

so this was apparently indeed caused by grid jobs being executed on hosts in the webgrid-lighttd queue.

I have updated /data/project/suggestbot/crontab:

--- crontab     2016-09-05 16:36:27.010493060 +0000
+++ crontab.new 2016-12-06 21:52:39.599496230 +0000
@@ -9,20 +9,20 @@
 # Jobs updating inlink count tables
 # Run every morning for enwiki, nowiki, svwiki, ptwiki, fawiki, huwiki,
 # and ruwiki
-14 1 * * * qsub $HOME/link-rec/update_linkcounts.sh no 2>$HOME/crontab-errors.txt >/dev/null
-14 2 * * * qsub $HOME/link-rec/update_linkcounts.sh sv 2>$HOME/crontab-errors.txt >/dev/null
-14 3 * * * qsub $HOME/link-rec/update_linkcounts.sh pt 2>$HOME/crontab-errors.txt >/dev/null
-14 4 * * * qsub $HOME/link-rec/update_linkcounts.sh fa 2>$HOME/crontab-errors.txt >/dev/null
-14 5 * * * qsub $HOME/link-rec/update_linkcounts.sh hu 2>$HOME/crontab-errors.txt >/dev/null
-14 6 * * * qsub $HOME/link-rec/update_linkcounts.sh en 2>$HOME/crontab-errors.txt >/dev/null
-14 7 * * * qsub $HOME/link-rec/update_linkcounts.sh ru 2>$HOME/crontab-errors.txt >/dev/null
-14 8 * * * qsub $HOME/link-rec/update_linkcounts.sh fr 2>$HOME/crontab-errors.txt >/dev/null
+14 1 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh no 2>$HOME/crontab-errors.txt >/dev/null
+14 2 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh sv 2>$HOME/crontab-errors.txt >/dev/null
+14 3 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh pt 2>$HOME/crontab-errors.txt >/dev/null
+14 4 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh fa 2>$HOME/crontab-errors.txt >/dev/null
+14 5 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh hu 2>$HOME/crontab-errors.txt >/dev/null
+14 6 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh en 2>$HOME/crontab-errors.txt >/dev/null
+14 7 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh ru 2>$HOME/crontab-errors.txt >/dev/null
+14 8 * * * qsub -q task $HOME/link-rec/update_linkcounts.sh fr 2>$HOME/crontab-errors.txt >/dev/null
 
 # Job(s) updating en:Wikipedia:Community portal/Opentask
-15,45 * * * * qsub $HOME/projects/opentask/opentask.sh 2>$HOME/crontab-errors.txt >/dev/null
+15,45 * * * * qsub -q task $HOME/projects/opentask/opentask.sh 2>$HOME/crontab-errors.txt >/dev/null
 
 # Job(s) updating lists of open tasks on other Wikis
-31 * * * * qsub $HOME/projects/opentask/launch-lang.sh pl "Wikipedia:Sprzątanie Wikipedii/Sugerowane listy" $HOME/projects/opentask/plwiki-taskdef.json 2>$HOME/crontab-errors.txt >/dev/null
+31 * * * * qsub -q task $HOME/projects/opentask/launch-lang.sh pl "Wikipedia:Sprzątanie Wikipedii/Sugerowane listy" $HOME/projects/opentask/plwiki-taskdef.json 2>$HOME/crontab-errors.txt >/dev/null
 
 # Job(s) publishing inlink counts for Wikis
-35 7 * * 7 qsub $HOME/projects/inlink-counts/publish-inlink-counts.sh en 2>$HOME/crontab-errors.txt >/dev/null
+35 7 * * 7 qsub -q task $HOME/projects/inlink-counts/publish-inlink-counts.sh en 2>$HOME/crontab-errors.txt >/dev/null

I would suggest replacing the calls to qsub -q task with jsub -quiet.

I just want the record to show my appreciation for your work on this @scfc, thank you so much for figuring this out! I'll look into switching the qsub calls with jsub as you suggest. Thanks again for the work you've done on this!