Page MenuHomePhabricator

continuous jobs killed during restart despite rescheduling
Closed, ResolvedPublic

Description

After T99027, I dove a bit into the sge accounting logs, and it seems the same issue happened again today, but with tools-exec-1212. The following entries are in the log:

end_time host user task taskid error_status
2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.lrbot lrbot 1493743 100
2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.admin toolhistory 257 100
2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.hewiki-tools webServ 1757583 100
2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.commons-delinquent demon 108502 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.cluestuff recent_huggle 5586 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.loltrs loltrs 14927 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.ralgisbot tes-isbn 331267 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.ecmabot ecmabot-wm 521093 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.ralgisbot tes-links 1738078 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.sigma sandbot3 503 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.yifeibot rmiw.w5 1702726 100
2015-08-17 15:04:58 tools-exec-1212.eqiad.wmflabs tools.theoslittlebot wpstubs 414321 100
2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.ralgisbot bes-zdp 331293 100
2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.ralgisbot ves-redir 331279 100

error_status = 100 = 'assumedly failed after job'.

The timing suggests this is related to the qmod -rj rather than the actual reboot:

[15:04:46] <valhallasw`cloud>	 ok, I now force-rescheduled a whole batch of jobs
[15:04:53] <valhallasw`cloud>	 with sudo qmod -rj $(qhost -j -h $HOSTS | sed -e 's/^\s*//' | cut -d ' ' -f 1|egrep ^[0-9])
[15:05:07] <valhallasw`cloud>	 and qhost -j -h $HOSTS is now clean
[15:05:17] <andrewbogott>	 ok, so… can I reboot now?  :)
[15:05:20] <valhallasw`cloud>	 yep!

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, Andrew.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Affected tasks the 13th (T99027):

2015-08-13 15:01:18 tools-exec-1214.eqiad.wmflabs tools.defconbot defconbot 679417 100
2015-08-13 15:01:18 tools-exec-1214.eqiad.wmflabs tools.dexbot ghaher69 4567 100
2015-08-13 15:01:18 tools-exec-1214.eqiad.wmflabs tools.bracketbot run 5543 100
2015-08-13 15:01:18 tools-exec-1214.eqiad.wmflabs tools.ralgisbot wes-isbn 331285 100
2015-08-13 15:01:18 tools-exec-1214.eqiad.wmflabs tools.cluebot cbng_redis_relay 493967 100
2015-08-13 15:01:18 tools-exec-1214.eqiad.wmflabs tools.cluebot cbng_relay 493977 100
2015-08-13 15:01:08 tools-exec-1214.eqiad.wmflabs tools.hewiki-tools webServ 258 100
2015-08-13 15:01:08 tools-exec-1214.eqiad.wmflabs tools.wpv cmd_perl 1708628 100
2015-08-13 15:01:08 tools-exec-1214.eqiad.wmflabs tools.giftbot sga 1565 100

possibly affected tasks the 12th (not completely sure about this set)

2015-08-12 14:48:11 tools-exec-1206.eqiad.wmflabs tools.ralgisbot qes-zdp 331282 100
2015-08-12 14:48:11 tools-exec-1206.eqiad.wmflabs tools.wikihistory dewiki_update2 1053851 100
2015-08-12 14:48:21 tools-exec-1206.eqiad.wmflabs tools.ralgisbot wes-linksan 331280 100
2015-08-12 14:48:27 tools-exec-1218.eqiad.wmflabs tools.ralgisbot ves-bienv 331274 100
2015-08-12 14:48:27 tools-exec-1218.eqiad.wmflabs tools.abibot abibot 1248683 100
2015-08-12 14:48:30 tools-exec-1202.eqiad.wmflabs tools.cluestuff recent_huggle3 5587 100
2015-08-12 14:48:30 tools-exec-1202.eqiad.wmflabs tools.ralgisbot tes-links 331262 100
2015-08-12 14:48:36 tools-exec-1209.eqiad.wmflabs tools.betacommand-dev stalkboten 1288487 100
2015-08-12 14:48:11 tools-exec-1206.eqiad.wmflabs tools.coibot coibot 40178 100
2015-08-12 14:48:46 tools-exec-1209.eqiad.wmflabs tools.anon anon 66978 100
2015-08-12 14:48:46 tools-exec-1209.eqiad.wmflabs tools.congressedits congresseditors 124436 100
2015-08-12 14:48:46 tools-exec-1209.eqiad.wmflabs tools.ralgisbot wes-isbnanx 331288 100
2015-08-12 14:48:46 tools-exec-1209.eqiad.wmflabs tools.tc-rc start 1082895 100
2015-08-12 14:48:46 tools-exec-1209.eqiad.wmflabs tools.phetools ws_ocr_daemon 1578824 100
2015-08-12 14:48:46 tools-exec-1209.eqiad.wmflabs tools.morebots production-logbot 674678 100
2015-08-12 14:48:46 tools-exec-1209.eqiad.wmflabs tools.morebots mostbots 674710 100
2015-08-12 14:48:17 tools-exec-1218.eqiad.wmflabs tools.betacommand-dev stalkbotalerts 1288488 100

The list of jobs on the 13th was caused because servers were rebooted without rescheduling jobs. I'm not sure why not all jobs died, though -- apparently SGE did a fairly good job at rescheduling when the hosts went down.

Strangely enough, the same pattern happened today anyway. Jobs are /first/ rescheduled, and the failed job occurs only afterwards:

error_status, end_date, etc

25 1970-01-01 00:00:00 tools-exec-1212.eqiad.wmflabs tools.lrbot lrbot 1493743
100 2015-08-17 15:04:48 tools-exec-1212.eqiad.wmflabs tools.lrbot lrbot 1493743

error_status 25 = rescheduled, 100 = failed.

On the 12th e.g.

25 1970-01-01 00:00:00 tools-exec-1206.eqiad.wmflabs tools.ralgisbot qes-zdp 331282
100 2015-08-12 14:48:11 tools-exec-1206.eqiad.wmflabs tools.ralgisbot qes-zdp 331282

The timing again coincides with the qmod -rj:

[14:41:10] <andrewbogott>	 !log tools forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410
[14:41:13] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy
[14:48:54] <andrewbogott>	 valhallasw`cloud: I just ran your suggested qmod commands but I still see jobs as running on those exec nodes.  Can you advise?
[14:50:14] <valhallasw`cloud>	 andrewbogott: mmm
[14:51:24] <andrewbogott>	 valhallasw`cloud: this is what I’m running (with all node names as args)  https://dpaste.de/8EML
[14:51:42] <andrewbogott>	 should the second step just be qdel instead?

(the confusion is due to qmod -rj not killing jobs in the 'task' queue). The reboot happened much later, around

[15:03:23] <andrewbogott>	 valhallasw`cloud: anyway, ready for me to reboot, yes?

With the rescheduling of all webservices today, we got a few more data points. A few jobs died during rescheduling, according to the accounting log:

100 2015-08-31 18:38:03 tools-webgrid-lighttpd-1201.eqiad.wmflabs tools.jembot lighttpd-jembot 498511
100 2015-08-31 19:33:21 tools-webgrid-generic-1403.eqiad.wmflabs tools.clickstream-api uwsgi-python-clickstream-api 1719409
100 2015-08-31 19:33:23 tools-webgrid-generic-1403.eqiad.wmflabs tools.dimensioner uwsgi-python-dimensioner 596
100 2015-08-31 19:33:26 tools-webgrid-generic-1403.eqiad.wmflabs tools.faces uwsgi-python-faces 821
100 2015-08-31 19:33:38 tools-webgrid-generic-1404.eqiad.wmflabs tools.checker uwsgi-python-checker 1719330
100 2015-08-31 19:33:25 tools-webgrid-generic-1404.eqiad.wmflabs tools.wp-signpost uwsgi-python-wp-signpost 1219960
100 2015-08-31 19:33:25 tools-webgrid-generic-1404.eqiad.wmflabs tools.outofband uwsgi-python-outofband 1219966
100 2015-08-31 19:34:26 tools-webgrid-generic-1404.eqiad.wmflabs tools.ifttt-dev uwsgi-python-ifttt-dev 1423661
100 2015-08-31 20:22:11 tools-webgrid-generic-1404.eqiad.wmflabs tools.fab-proxy uwsgi-python-fab-proxy 1757803
100 2015-08-31 20:22:20 tools-webgrid-generic-1404.eqiad.wmflabs tools.cobot uwsgi-python-cobot 1757902
100 2015-08-31 20:22:05 tools-webgrid-generic-1404.eqiad.wmflabs tools.phabricator-bug-status uwsgi-python-phabricator-bug-status 1757901

however, according to Webservicewatcher, a 335 (!) of the ~550 jobs had to be restarted! So not only is SGE losing jobs, it's losing jobs without telling us :-(

The webservicelog file is in /home/valhallasw/webservicelog-150832.txt.

@valhallasw: Possibly silly question, but does webservicewatcher actually handle the case where a job is queued for restart properly? If you get a bunch of jobs all at once in the queue for restart (as would happen if you restarted a bunch at once), they'll spend some time queued as gridengine refuses to start jobs too fast to avoid flooding any node.

If webservicewatcher sees the jobs not running (even though they are queued for restart) it may think them dead and forcibly restart them needlessly?

Ah. Yes, that might be true· Checking one example:

tools.gerrit-patch-uploader@tools-bastion-01:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 501752 0.39337 lighttpd-g tools.gerrit r     08/31/2015 20:58:31 webgrid-lighttpd@tools-webgrid     1

which suggests the job /was/ restarted by WSW (there's no R for 'restarted by SGE', and the start at time corresponds to the times mentioned above). The accounting file contains

webgrid-lighttpd:tools-webgrid-lighttpd-1202.eqiad.wmflabs:tools.gerrit-patch-uploader:tools.gerrit-patch-uploader:lighttpd-gerrit-patch-uploader:37383:sge:0:0:0:0:25:0:0:0.000000:0.000000:0.000000:0:0:0:0:0:0:0:0.000000:0:0:0:0:0:0:NONE:defaultdepartment:NONE:1:0:0.000000:0.000000:0.000000:-u tools.gerrit-patch-uploader -q webgrid-lighttpd -l h_vmem=4g,release=precise:0.000000:NONE:0.000000:0:0
webgrid-lighttpd:tools-webgrid-lighttpd-1209.eqiad.wmflabs:tools.gerrit-patch-uploader:tools.gerrit-patch-uploader:lighttpd-gerrit-patch-uploader:37383:sge:0:1439909884:1441054241:1441054707:0:0:466:0.056003:0.036002:6132.000000:0:0:0:0:10540:0:0:16.000000:72:0:0:0:606:19:NONE:defaultdepartment:NONE:1:0:0.480000:0.050081:0.003333:-u tools.gerrit-patch-uploader -q webgrid-lighttpd -l h_vmem=4g,release=precise:0.000000:NONE:2409873408.000000:0:0

or in a more readable format:
exit status, timestamp, host, user, job, jobid

25 1970-01-01 00:00:00 tools-webgrid-lighttpd-1202.eqiad.wmflabs tools.gerrit-patch-uploader lighttpd-gerrit-patch-uploader 37383
0 2015-08-31 20:58:27 tools-webgrid-lighttpd-1209.eqiad.wmflabs tools.gerrit-patch-uploader lighttpd-gerrit-patch-uploader 37383

which looks like a graceful shutdown, consistent with WSW restarting the job.

That feels like wrong behaviour of WebServiceMonitor to me. If there already is a job in any state, IMHO it shouldn't start another one.

That feels like wrong behaviour of WebServiceMonitor to me. If there already is a job in any state, IMHO it shouldn't start another one.

It's incorrect in that case, certainly, but not all states should prevent restarting - only those that don't indicate that it's already on its way up (any variation on queued, basically).

The problem with "all states but any variation on queued" is T107878. When a restart fails, WebServiceMonitor just keeps on trying every 15 s.

The problem with "all states but any variation on queued" is T107878. When a restart fails, WebServiceMonitor just keeps on trying every 15 s.

IMO that's a design flaw. It should:
(a) consider that anything queued is 'started enough' to not try to restart; and
(b) it should only accept to restart things N times in a X window (where N and X are arbitrary)

Change 239377 had a related patch set uploaded (by coren):
webservicemonitor: some improvements

https://gerrit.wikimedia.org/r/239377

Change 239377 merged by coren:
webservicemonitor: some improvements

https://gerrit.wikimedia.org/r/239377

bd808 claimed this task.