After T99027, I dove a bit into the sge accounting logs, and it seems the same issue happened again today, but with tools-exec-1212. The following entries are in the log:
end_time host user task taskid error_status 2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.lrbot lrbot 1493743 100 2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.admin toolhistory 257 100 2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.hewiki-tools webServ 1757583 100 2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.commons-delinquent demon 108502 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.cluestuff recent_huggle 5586 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.loltrs loltrs 14927 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.ralgisbot tes-isbn 331267 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.ecmabot ecmabot-wm 521093 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.ralgisbot tes-links 1738078 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.sigma sandbot3 503 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.yifeibot rmiw.w5 1702726 100 2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.theoslittlebot wpstubs 414321 100 2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.ralgisbot bes-zdp 331293 100 2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.ralgisbot ves-redir 331279 100
error_status = 100 = 'assumedly failed after job'.
The timing suggests this is related to the qmod -rj rather than the actual reboot:
[15:04:46] <valhallasw`cloud> ok, I now force-rescheduled a whole batch of jobs [15:04:53] <valhallasw`cloud> with sudo qmod -rj $(qhost -j -h $HOSTS | sed -e 's/^\s*//' | cut -d ' ' -f 1|egrep ^[0-9]) [15:05:07] <valhallasw`cloud> and qhost -j -h $HOSTS is now clean [15:05:17] <andrewbogott> ok, so… can I reboot now? :) [15:05:20] <valhallasw`cloud> yep!