Page MenuHomePhabricator

mediawiki-mirror jobs generate grid errors (?)
Closed, InvalidPublic

Description

As a side-effect of T122638, the roots received the "shepherd traces" for job(s?) #501129:

12/21/2015 20:24:06 [600:7867]: wait3 returned -1
12/21/2015 20:24:06 [600:7867]: forward_signal_to_job(): mapping signal 20 TSTP
12/21/2015 20:24:06 [600:7867]: mapped signal TSTP to signal KILL
12/21/2015 20:24:06 [600:7867]: queued signal KILL
12/21/2015 20:24:06 [600:7867]: kill(-7869, TERM) -> overriddes kill(-7869, KILL)
12/21/2015 20:24:06 [600:7867]: now sending signal TERM to pid -7869
12/21/2015 20:24:09 [600:7867]: wait3 returned -1
12/21/2015 20:24:09 [600:7867]: forward_signal_to_job(): mapping signal 20 TSTP
12/21/2015 20:24:09 [600:7867]: mapped signal TSTP to signal KILL
12/21/2015 20:24:09 [600:7867]: queued signal KILL
12/21/2015 20:24:09 [600:7867]: kill(-7869, TERM) -> overriddes kill(-7869, KILL)
12/21/2015 20:24:09 [600:7867]: now sending signal TERM to pid -7869
12/21/2015 20:24:49 [600:7867]: wait3 returned -1
12/21/2015 20:24:49 [600:7867]: forward_signal_to_job(): mapping signal 20 TSTP
12/21/2015 20:24:49 [600:7867]: mapped signal TSTP to signal KILL
12/21/2015 20:24:49 [600:7867]: queued signal KILL
12/21/2015 20:24:49 [600:7867]: kill(-7869, TERM) -> overriddes kill(-7869, KILL)
12/21/2015 20:24:49 [600:7867]: now sending signal TERM to pid -7869
[…]

going on for 100k lines twice. I don't know what that means, but – once the dust of T122638 has settled – it would be nice to see if it can be avoided.

Event Timeline

scfc raised the priority of this task from to Medium.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.
scfc moved this task to Ready to be worked on on the Toolforge board.
scfc added subscribers: scfc, Anomie, Nemo_bis.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

qacct -j 501129 indicates that the job was the lighttpd job. The rest of the stats are oddly empty, no idea whether that's due to T122638 blowing everything up or whatever was causing the huge log file.

qname        webgrid-lighttpd
hostname     tools-webgrid-lighttpd-1409.eqiad.wmflabs
group        tools.mediawiki-mirror
owner        tools.mediawiki-mirror
project      NONE
department   defaultdepartment
jobname      lighttpd-mediawiki-mirror
jobnumber    501129
taskid       undefined
account      sge
priority     0
qsub_time    Thu Jan  1 00:00:00 1970
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        1
failed       25  : rescheduling
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined
==============================================================
qname        webgrid-lighttpd
hostname     tools-webgrid-lighttpd-1409.eqiad.wmflabs
group        tools.mediawiki-mirror
owner        tools.mediawiki-mirror
project      NONE
department   defaultdepartment
jobname      lighttpd-mediawiki-mirror
jobnumber    501129
taskid       undefined
account      sge
priority     0
qsub_time    Mon Aug 31 20:33:34 2015
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        1
failed       19  : before writing exit_status
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined
==============================================================
qname        webgrid-lighttpd
hostname     tools-webgrid-lighttpd-1403.eqiad.wmflabs
group        tools.mediawiki-mirror
owner        tools.mediawiki-mirror
project      NONE
department   defaultdepartment
jobname      lighttpd-mediawiki-mirror
jobnumber    501129
taskid       undefined
account      sge
priority     0
qsub_time    Mon Aug 31 20:33:34 2015
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        1
failed       19  : before writing exit_status
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined
==============================================================
qname        webgrid-lighttpd
hostname     tools-webgrid-lighttpd-1405.eqiad.wmflabs
group        tools.mediawiki-mirror
owner        tools.mediawiki-mirror
project      NONE
department   defaultdepartment
jobname      lighttpd-mediawiki-mirror
jobnumber    501129
taskid       undefined
account      sge
priority     0
qsub_time    Mon Aug 31 20:33:34 2015
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        1
failed       19  : before writing exit_status
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined

Some of the HTML pages generated by doxygen are quite huge, but static HTML oughtn't crash the webserver, right?

scfc claimed this task.

Another instance occured on Wednesday with tools.ifttt:

Job 2551539 caused action: none
 User        = tools.ifttt
 Queue       = webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs
 Start Time  = <unknown>
 End Time    = <unknown>
failed before writing exit_status:shepherd exited with exit status 19: before writing exit_status
Shepherd trace:
01/21/2016 18:51:02 [600:8177]: shepherd called with uid = 0, euid = 600
01/21/2016 18:51:02 [600:8177]: starting up 6.2u5
01/21/2016 18:51:02 [600:8177]: setpgid(8177, 8177) returned 0
01/21/2016 18:51:02 [600:8177]: no prolog script to start
01/21/2016 18:51:02 [600:8177]: parent: forked "job" with pid 8179
01/21/2016 18:51:02 [600:8177]: parent: job-pid: 8179
01/21/2016 18:51:02 [600:8179]: child: starting son(job, /usr/local/bin/tool-uwsgi-python, 0);
01/21/2016 18:51:02 [600:8179]: pid=8179 pgrp=8179 sid=8179 old pgrp=8177 getlogin()=<no login set>
01/21/2016 18:51:02 [600:8179]: reading passwd information for user 'tools.ifttt'
01/21/2016 18:51:02 [600:8179]: setosjobid: uid = 0, euid = 600
01/21/2016 18:51:02 [600:8179]: setting limits
01/21/2016 18:51:02 [600:8179]: RLIMIT_CPU setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
01/21/2016 18:51:02 [600:8179]: RLIMIT_FSIZE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
01/21/2016 18:51:02 [600:8179]: RLIMIT_DATA setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
01/21/2016 18:51:02 [600:8179]: RLIMIT_STACK setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
01/21/2016 18:51:02 [600:8179]: RLIMIT_CORE setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
01/21/2016 18:51:02 [600:8179]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 4000000000 hard 4000000000) resulting: (soft 4000000000 hard 4000000000)
01/21/2016 18:51:02 [600:8179]: RLIMIT_RSS setting: (soft INFINITY hard INFINITY) resulting: (soft INFINITY hard INFINITY)
01/21/2016 18:51:02 [600:8179]: setting environment
01/21/2016 18:51:02 [600:8179]: Initializing error file
01/21/2016 18:51:02 [600:8179]: switching to intermediate/target user
01/21/2016 18:51:02 [52490:8179]: closing all filedescriptors
01/21/2016 18:51:02 [52490:8179]: further messages are in "error" and "trace"
01/21/2016 18:51:02 [52490:8179]: now running with uid=52490, euid=52490
01/21/2016 18:51:02 [52490:8179]: execvlp(/bin/bash, "-bash" "-c" "/usr/local/bin/tool-uwsgi-python ")
01/26/2016 21:12:25 [600:8177]: wait3 returned -1
01/26/2016 21:12:25 [600:8177]: forward_signal_to_job(): mapping signal 20 TSTP
01/26/2016 21:12:25 [600:8177]: mapped signal TSTP to signal KILL
01/26/2016 21:12:25 [600:8177]: queued signal KILL
01/26/2016 21:12:25 [600:8177]: kill(-8179, TERM) -> overriddes kill(-8179, KILL)
01/26/2016 21:12:25 [600:8177]: now sending signal TERM to pid -8179
01/26/2016 21:12:32 [600:8177]: wait3 returned -1
01/26/2016 21:12:32 [600:8177]: forward_signal_to_job(): mapping signal 20 TSTP
01/26/2016 21:12:32 [600:8177]: mapped signal TSTP to signal KILL
01/26/2016 21:12:32 [600:8177]: queued signal KILL
01/26/2016 21:12:32 [600:8177]: kill(-8179, TERM) -> overriddes kill(-8179, KILL)
01/26/2016 21:12:32 [600:8177]: now sending signal TERM to pid -8179
[…]
01/27/2016 18:28:57 [600:8177]: wait3 returned -1
01/27/2016 18:28:57 [600:8177]: forward_signal_to_job(): mapping signal 20 TSTP
01/27/2016 18:28:57 [600:8177]: mapped signal TSTP to signal KILL
01/27/2016 18:28:57 [600:8177]: queued signal KILL
01/27/2016 18:28:57 [600:8177]: kill(-8179, TERM) -> overriddes kill(-8179, KILL)
01/27/2016 18:28:57 [600:8177]: now sending signal TERM to pid -8179
01/27/2016 18:29:37 [600:8177]: wait3 returned -1
01/27/2016 18:29:37 [600:8177]: forward_signal_to_job(): mapping signal 20 TSTP
01/27/2016 18:29:37 [600:8177]: mapped signal TSTP to signal KILL
01/27/2016 18:29:37 [600:8177]: queued signal KILL
01/27/2016 18:29:37 [600:8177]: kill(-8179, TERM) -> overriddes kill(-8179, KILL)
01/27/2016 18:29:37 [600:8177]: now sending signal TERM to pid -8179

Shepherd pe_hostfile:
tools-webgrid-generic-1405.tools.eqiad.wmflabs 1 webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs UNDEFINED

In this case, on tools-webgrid-generic-1405 load is >> 100 and lots of processes are stuck in D, so maybe that all are symptoms of the NFS lock-ups that make those processes unkillably block on I/O. So I'm closing this as invalid until I find evidence that the specific job has anything to do with it.