Page MenuHomePhabricator

Webservice stuck, won't stop, can't restart
Closed, ResolvedPublic

Description

On my tools autolist, catscan2 and file-siblings, the webservice has become unresponsive. Trying to stop or restart the service does not work.

tools.file-siblings@tools-bastion-01:~$ webservice stop
Stopping web service..............................Timeout: could not stop job in 30stools.file-siblings@tools-bastion-01:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
    771 0.34763 lighttpd-f tools.file-s dr    12/30/2015 04:10:42 webgrid-lighttpd@tools-webgrid     1

Event Timeline

Magnus created this task.Jan 19 2016, 1:12 PM
Magnus raised the priority of this task from to Unbreak Now!.
Magnus updated the task description. (Show Details)
Magnus added a project: Cloud-Services.
Magnus added a subscriber: Magnus.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 19 2016, 1:12 PM
scfc added a subscriber: scfc.Jan 19 2016, 2:18 PM

All three tools's webservices run on tools-webgrid-lighttpd-1412 which seems to be overloaded as on login it is hanging after "The last Puppet run was at Mon Jan 18 17:09:23 UTC 2016 (1260 minutes ago)." and that line also means that Puppet couldn't successfully run probably due to insufficient load/memory, which in turn probably means that the SGE shepherd there can't be notified by the master to stop those webservices and report back. So if in the next few minutes I don't have a better idea, I'll reboot tools-webgrid-lighttpd-1412.

I've rebooted the instance, and looking at the console log on wikitech:

[1691760.896155] INFO: task jbd2/vda1-8:175 blocked for more than 120 seconds.
[1691760.902121]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691760.903224] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691760.905020] INFO: task kworker/u8:2:7054 blocked for more than 120 seconds.
[1691760.906295]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691760.907272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.908100] INFO: task init:1 blocked for more than 120 seconds.
[1691880.911512]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.911959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.912751] INFO: task jbd2/vda1-8:175 blocked for more than 120 seconds.
[1691880.913308]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.913756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.914463] INFO: task jbd2/dm-1-8:334 blocked for more than 120 seconds.
[1691880.915002]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.915456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.916200] INFO: task kworker/u8:2:7054 blocked for more than 120 seconds.
[1691880.916757]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.917204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.917964] INFO: task cron:18501 blocked for more than 120 seconds.
[1691880.918478]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.918924] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.919642] INFO: task sort:18520 blocked for more than 120 seconds.
[1691880.920159]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.920606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.921276] INFO: task sort:18540 blocked for more than 120 seconds.
[1691880.921781]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.922686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.923361] INFO: task sort:18544 blocked for more than 120 seconds.
[1691880.923869]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.924358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

this looks like the same issue as @valhallasw's T123835.

scfc closed this task as Resolved.Jan 19 2016, 2:39 PM
scfc claimed this task.

All webservices are responding AFAICS now having been restarted by the webservice watcher. I'll leave investigating the underlying issue to T123835.

Another "immortal" web service, this time for tool "catfood".

Magnus reopened this task as Open.Jan 20 2016, 10:56 AM
scfc added a comment.Jan 20 2016, 2:32 PM

catfood is "running" on tools-webgrid-lighttpd-1209, that instance is handled by T124162, so the (initial) scope of this task was resolved.

scfc closed this task as Resolved.Jan 20 2016, 2:33 PM